Inconsistent Results with AutoCasperNBI

bmarks
Contributor II

Up until recently, I have had flawless results when using AutoCasperNBI to create a NetBoot image for our environment. However, I am now trying to create a Sierra image with 10.12.3 and I am getting really inconsistent results when deploying this image to my 36 imaging servers. I have exhausted all my ideas, so I may need some help.

I created a base image with AutoDMG using 10.12.3 and then I used that base image with the newest version of AutoCasperNBI. On my test Mac mini running Server.app 5.2, it booted every Mac I tried, including the newest TouchBar Macs. However, this NetBoot image does not boot all Macs when I deploy it to other servers. Worst of all, I cannot determine a pattern. Some of our imaging labs will boot all Macs just fine with this new image. One lab wouldn't boot current Mac Pros. One lab wouldn't boot the newest TouchBar Macs but would boot everything else. Some labs wouldn't boot anything.

I'm completely lost at this point. I have never had any issues like this with NetBoot + AutoCasperNBI. Any insight/ideas/completely random thoughts are greatly appreciated.

P.S. - I did make sure to create this NetBoot image on my newest piece of hardware, a 15inch TouchBar Mac.

39 REPLIES 39

bentoms
Release Candidate Programs Tester

@bmarks i'd take one lab, & isolate their issues before looking at the whole.

bmarks
Contributor II

I tried. But I just can't seem to notice any pattern. I used the exact same settings in AutoCasperNBI, the only differences being my base OS version and Casper imaging version. I've been focusing my testing on our London lab. At first, it booted nothing. Then, I recopied it and it started working. However, it only worked for about 20 out 30 NetBoot attempts across both current and previous-generation hardware. These intermittent results seem to be the same at our other labs. It's so confusing. I used ARD to copy it to the Desktop and then manually moved it to the appropriate folder. Nothing I've done has been different than my previous NetBoot image updates over the past couple of years.

bentoms
Release Candidate Programs Tester

@bmarks Well, NetBoot has a heap of variables.

What OS NBI's were working before?
What's not working now?
What server is hosting the NBI's?

a_stonham
Contributor II

I have noticed a similar problem. The trick was to NOT reduce the image size when creating the netboot image.
Unticking reduce image size seems to have resolved it for me.

powellbc
Contributor II

I have had this issue too. The only change in my workflow was using a 10.12.x based image.

@a.stonham I never check the reduce image size box and I still have had this issue. I will build a new one to test and 100% verify and will report back.

bentoms
Release Candidate Programs Tester

@a.stonham & @powellbc What server is hosting the NBI's? NetSUS?

powellbc
Contributor II

I am using macOS Server 10.12. I had been using an earlier version problem (I think 10.10 or 10.9) when the appeared. I hoped upgrading would solve the issue but no dice.

bentoms
Release Candidate Programs Tester

@powellbc Odd.. I have a number of our customer using AutoCasperNBI &/or AutoImagrNBI 10.12.x NBI's on macOS server 10.9+, & it's happily working.

Once or twice, I have had to change from NFS to HTTP. Might be worth a shot?

These are always reduced size NBI's too.

One other thing, is "Install modified rc.netboot" checked when creating the NBI? I normally leave this checked.

powellbc
Contributor II

@bentoms

I have tried a bunch of things to resolve the issue, including serving over HTTP to no avail. I do check the "Install modified rc.netboot" and as mentioned I do not check reduce image size. Do you recommend I try the settings you described (reduce size, and check "Install modified rc.netboot")?

bentoms
Release Candidate Programs Tester

@powellbc Can you verbose booted a mac & see if that helps track down where this issue is occurring?

bmarks
Contributor II

For me, I have had the issue on Mac imaging servers running OS X 10.10.5, 10.11.6, macOS 10.12.3, NetSUS version 3 and NetSUS version 4 (I have at least one of each from past testing environments.) I do check off the "Install modified rc.netboot" checkbox. I created one yesterday that isn't reduced in size but I haven't tested it yet. Just to repeat, not counting these test NBI's, the only change I initially made was the base OS from El Capitan to Sierra. I can try and get a verbose login pic too.

bmarks
Contributor II

However, just to repeat as well, it sometimes works on all of the above as well. That's what's so challenging.

dferrara
Contributor II

Have you been able to rule out any funny business with the network? We had an issue that was very difficult to diagnose with our Palo Alto firewall. It seemed to think NetBoot traffic was a packet-based attack and would intermittently prevent Macs from booting.

powellbc
Contributor II

In our case every other boot disk we have available works 100% of the time. Only the 10.12.x ones have issues.

bmarks
Contributor II

We've had some of those types of issues in the past, but they've always been isolated to one imaging lab (I manage 40 imaging labs.) In this case, I'm pretty sure I've rules out those types os issues, especially since our previous non-Sierra images seem to still work fine.

I've been using AutoCasperNBI basically since it launched and it's a great app. I've never had an issue until now. I guess I shouldn't say that though since this may not be an issue with AutoCasperNBI.

bmarks
Contributor II

I'm not sure what these log entries mean, but my coworker testing this in another lab sent me these log entries that he says are only occurring when a Mac doesn't boot from the NBI:

Mar 22 17:47:36 caspershare-lon2.internal.pretendoco.com servermgr_netboot[7036]: updateHTTPSharepoint: default site was nil using default path

Mar 22 17:47:36 caspershare-lon2.internal.pretendco.com servermgr_netboot[7036]: updateHTTPSharepoint: received error from servermgr_web Error Domain=com.apple.servermgrd Code=3 "The operation couldn't be completed. (com.apple.servermgrd error 3.)

As a side note, not shrinking the image doesn't seem to help.

And, I don't know if these things are even related, but since I see "HTTP" in the above logs, I'll mentioned that we use NFS for our NBI's.

powellbc
Contributor II

Verbose boot showed tons of errors saying Caller not allowed to perform action: smd:209 action = service removal, code =150: Operation not permitted while System Integrity Protection is engaged'

I saw above that
error rreading http code, returning kIOReturnInteral Error'
...
_peerManager is missing

At this point it seems to be repeating the top error repeatedly, and boot never completes.

bmarks
Contributor II

I was just sent similar logs. Why would SIP be triggered though? It's not relevant when you're NetBooting from the Boot Picker, right?

bentoms
Release Candidate Programs Tester

@powellbc & @bmarks NetBoot Images have SIP enabled, (Apple keeps it enabled so ACNBI does too).. NetInstall do not.

What are the permissions on the NBI's? The folder & contents

bmarks
Contributor II

Here are the permissions for the NBI and the contents of the NBI. Do you want me to go deeper? FYI, just to be clear, "macinstaller" is our local admin user on these Mac imaging servers. Do you want me to go deeper?

drwxrwxr-x  5 root  admin   170B Mar 22 12:21 macOS_Imaging_All_Macs_V3.nbi

-rw-r--r--  1 macinstaller  staff   2.8K Mar 21 13:32 NBImageInfo.plist
-rw-rw-r--  1 root          admin   8.5G Mar 21 13:33 NetBoot.dmg
drwxr-xr-x  5 macinstaller  staff   170B Mar 22 12:21 i386

bentoms
Release Candidate Programs Tester

@bmarks Can you try root:admin throughout the NBI, please

bmarks
Contributor II

Testing now. This may take a little while.

bmarks
Contributor II

This doesn't not appear to make any difference.

bentoms
Release Candidate Programs Tester

@bmarks ok.. does server.app show HTTP being used or NFS? I know you mentioned you selected NFS, but the logs seem to show HTTP..

Whatever it is, try the other.

bmarks
Contributor II

I just checked to be certain and it is definitely set to NFS.

I'll try the other option now.

bmarks
Contributor II

Actually, I take back part of what I just typed. When testing a current-model TouchBar MacBook Pro 15inch, there is a pause after the globe. The progress gets stuck and appears to be making zero progress. I waited a few minutes on my previous attempt and thought it had failed which is when I replied. However, I booted again into verbose mode and noted that it appears to pause at this line in the verbose mode logs: Extension SDK cache is not present. Attempting to rebuild...

However, I waited longer this time and even though the progress bar never makes progress, it does eventually boot. This is the case over NFS or HTTP. Now, since the results are intermittent I can't be sure this is a 100% fix. I had to started testing on my own because my coworker in London went home for the evening. I may need to wait until tomorrow to have him further verify whether changing the permissions helps.

bentoms
Release Candidate Programs Tester

@bmarks Cool. I'm NW of London, so should not be computering now either :)

Once or twice, I have had to change from NFS to HTTP. Might be worth a shot?

So, might be a better default/recommendation now

bmarks
Contributor II

I did see in the verbose mode logs something along the lines of "System Integrity Protection is engaged." Other Macs I am testing here at my desk are working so far. However, I may not be the best test candidate because part of the issue was that I didn't have any of these issues personally before deploying the NBI.

bentoms
Release Candidate Programs Tester

@bmarks that's normal.

TBH, i've tried as much as possible to stay as true to Apple's NetBoot creation scripts as possible. So the SIP message is normal.

powellbc
Contributor II

I was mistaken. In the troubleshooting process I had set the 10,12 boot disks to boot over HTTP, In any case, it does not work reliably with that or NFS.:(

bmarks
Contributor II

Sorry for delay, had to wait for someone in a different time zone. It does not appear that chown'ing the NBI makes a difference.

bmarks
Contributor II

These don't feel like issues that'll be fixed by an OS update, but I am going to create a new NBI with 10.12.4 today and I'll post the results here.

bmarks
Contributor II

I saw this a while ago, but didn't mention it. I just saw it again though. The first time I booted this new NBI, I saw a "Completing Installation: X minutes remaining" message. I think it only happens once. My base OS was created with AutoDMG. I don't know if this means anything, but I figured I'd mention it this time.

bentoms
Release Candidate Programs Tester

@bmarks odd..

But please let me know how 10.12.4 goes.

If it works locally, but not in the remote location.. is it possible to build an NBI at the remote location?

bmarks
Contributor II

For me in Portland, I think there is some improvement. Last week, I was unable to boot a 2012 MacBook Air (that was my only failure last week.) This week with 10.12.4 I can. More of the failures, though, were in our London lab and I will definitely post those results tomorrow.

Additionally, I did just see the ""Completing Installation: X minutes remaining" a second time. I tested four Macs here at my desk, and the two TouchBar models were the ones that displayed that message. What I didn't do you is chown the NBI like before, so I will see if that makes any difference.

bmarks
Contributor II

After chown'ing the NBI, I'm two for two not seeing that "Completing Installation: X minutes remaining" message after I was previously two for two seeing it on different TouchBar Macs. Maybe just an anomaly.

bentoms
Release Candidate Programs Tester

@bmarks oh.. TBP.. hmm.. maybe that message is due to this?

powellbc
Contributor II

I meant to post this earlier, but this is now working for us.

The boot disk is based on 10.12.4 and we checked reduce image size as well as "Install modified rc.netboot". In the past we never checked reduce image size, so perhaps that was it, or the 10.12.4 release.

bentoms
Release Candidate Programs Tester

@powellbc Awesome news