mind the explanatory gap

many a slip ‘twixt mind and lip…

mind the explanatory gap RSS Feed

AFP. It ain't so bad.

AFP. It ain’t so bad.

From looking at the mailing lists and discussion boards, you’ll see that a reasonable number of people seem to be having problems with their AFP servers under OS X Server 10.4.x, particularly under heavy load with network home directories, and particularly in terms of stability. I used to be one of those people, but the solutions presented in this article have (touch wood) resolved the vast majority of my issues.

Whereas under 10.3 I used to get a symptom where the whole server would lock up and not even be pingable, the symptoms I’ve seen under 10.4.x have been entirely related to the AFP server. Login times will get longer and longer, access speed becomes slower and slower, and eventually the AFP server enters a death spiral, where the only recourse seems to be restarting the service.

I’m going to quickly cover some of the more commonly known fixes and workarounds that I’d tried before these major solutions.

1) Make sure DNS is working with proper forward and reverse entries for your servers at the bare minimum, and preferably all your client machines as well.

2) Redirect ~/Library/Caches to a local folder for network home directory users.
This makes quite a big difference. If you start watching your AFP server logs, you’ll notice that your network home directory users hit the cache a lot, particularly for apps like Safari. We do this with a login hook that redirects ~/Library/Caches to /Library/Caches/username/.

3) Disable creation of .DS_Store files for network mounts.
There is an Apple techinfo article up about this. From debugging, it looks like a corrupt .DS_Store file can create all sorts of problems, particularly if you have a setup like mine, where students have read-only access to certain network folders, and lecturers have write-access. I’ve used MCX to push out the above setting, and then run a command like this to delete all existing .DS_Store files on the sharepoint. The problem with this solution is that your users will get a bit narky about no longer being able to have window settings ’stick’ across sessions.

4) Check the integrity of your filesystem.

5) Put more RAM in your server.
I was previously looking at swapfile creation and vm_stat to work out whether my AFP servers needed more RAM, however the problem seems to be that the AFP server tries to avoid swapping to disk, and so these aren’t good metrics to work from. As soon as I stuck a couple more gig into my servers, they started using it…

6) Move heavy users to Mobile Accounts with Portable Home Directories.
This isn’t applicable in all situations, and there’s no way I could move all my students to Mobile Accounts, but for my staff members who use Office and Mail heavily, performance on a network home directory isn’t that great, even on a fast connection. If you have users who primarily use a single machine, Mobile Accounts where you control the synchronisation settings via MCX give you the best of both worlds.

7) Increase the sysctl parameters for max files and max files per process.
This is a more general tuning tip for OS X Server, but as Rob Middleton pointed out to me, without it, if you have AFP error logging on, it helpfully logs several hundred lines per second to inform you that it can’t open files… filling up your primary partition rather quickly…
Create /etc/sysctl.conf with the following parameters:

kern.maxfiles=200000

kern.maxfilesperproc=50000


If you want to apply those settings immediately, rather than waiting for a reboot you can do:

for i in $(cat /etc/sysctl.conf); do sysctl -w $i; done

Now onto the more interesting fixes…

1) Tweaking the WAN threshold and packet size on the clients

If you run this command on a client machine, you can see various AFP client tuning parameters. Some of your settings may be slightly different to these.

  nigelkersten@zombie: ~ $ defaults read -g com.apple.AppleShareClientCore
  {
      "afp_active_timeout" = 0;
      "afp_authtype_show" = 0;
      "afp_cleartext_allow" = 1;
      "afp_cleartext_warn" = 1;
      "afp_debug_level" = 6;
      "afp_debug_syslog" = 0;
      "afp_default_name" = "";
      "afp_idle_timeout" = 0;
      "afp_keychain_add" = 1;
      "afp_keychain_search" = 1;
      "afp_login_displayGreeting" = 1;
      "afp_maxDirCache" = 60;
      "afp_maxFileCache" = 60;
      "afp_minDirCache" = 5;
      "afp_minFileCache" = 5;
      "afp_mount_defaultFlags" = 0;
      "afp_no_kQueues" = 0;
      "afp_no_volChange_caching" = 1;
      "afp_prefs_version" = 2;
      "afp_reconnect_allow" = 1;
      "afp_reconnect_interval" = 10;
      "afp_reconnect_retries" = 12;
      "afp_ssh_allow" = 0;
      "afp_ssh_force" = 0;
      "afp_ssh_require" = 0;
      "afp_ssh_warn" = 1;
      "afp_use_default_name" = 0;
      "afp_use_short_name" = 0;
      "afp_voldlog_skipIfOnly" = 0;
      "afp_wan_quantum" = 8192;
      "afp_wan_threshold" = 30;
  }
  

Of particular interest are these two settings, ‘afp_wan_quantum’ and ‘afp_wan_threshold’. The values I’ve shown here should be the defaults, but you may have a setting of 0 for both of them. These are used by the client to work out whether a particular AFP connection is over a LAN or WAN connection. If the latency of a given connection is higher than the value in afp_wan_threshold, then the data chunk size drops from 128KiB (the default for a LAN connection) to 8KiB, the setting shown here.

The problem seems to be that this default threshold setting is way too low, and once the AFP server starts experiencing moderate load, your LAN clients start using the WAN data chunk size. Although smaller chunks are desirable for slow connections, they induce an overhead on the server in terms of processing as the server is dealing with 16x the number of chunks, and reduce overall throughput. A symptom of this is high CPU usage.

The good news is that we can change these settings. The values you choose are up to you, and depend largely upon whether you actually do have WAN clients to support, so I’m going to suggest two scenarios:

First scenario: No WAN clients (cause AFP sucks over slow connections anyway :) )

    afp_wan_quantum = 131072
    afp_wan_threshold = 1000
    

This is a bit of a shotgun approach. Bring the latency threshold way up, and even if the clients reach this threshold, force their chunk size to be the same as that of a LAN client. This is what I’ve done in my environment, mainly because I wanted to make sure that I could rule out this issue. As time goes on, and once I work out the correct way to measure latency for a client (see the footnote at the end), I’ll come up with more sane values, but at this stage, I just needed to stop my lecturers from revolting en masse due to AFP stability issues.

Second scenario: Some WAN clients who use AFP (I feel their pain…)

  afp_wan_quantum = 8192
  afp_wan_threshold = 200
  

This is very much a guesstimate on my part. I’ve done a few tests at a threshold of 100, and I’ve found that LAN clients were still getting the WAN chunk size when I set the threshold at 100. Again, see the footnote at the end, as I’d like to find out a way of getting an accurate picture of the latency a client is experiencing.

So as far as this setting goes, the good news (for me) is that it has almost entirely resolved performance issues with the AFP server under moderate load. I’ve also been experimenting with the maxFileCache and maxDirCache settings, and even when I’ve been increasing them from 60 to 6000, I’ve been unable to replicate any stale cache issues. If you so desire, try playing around with those settings.

Applying these settings to a client machine.

Well, there are a couple of options you have here.

1) Apply the settings manually.

One way would be to use a LoginHook to set these parameters using the defaults command. ie, the first scenario above could be done with the following two commands:

    defaults write -g com.apple.AppleShareClientCore -dict-add afp_wan_quantum -int 131702
    defaults write -g com.apple.AppleShareClientCore -dict-add afp_wan_threshold -int 1000
  

The “-g” refers to the fact that we’re writing these settings into the Global Domain, (for a user, at ~/Library/Preferences/.GlobalPreferences.plist), -dict-add means you’re adding a key value to a dictionary (named “com.apple.AppleShareClientCore”), and -int means you’re adding an integer value.

This may be the best solution for your environment, and is a good place to start testing before you move onto the next method…

2) Use MCX to manage the settings.

This is the way I’ve done it, as it has some nice side effects, and is more The Apple Way ™. As with all such settings, you can choose to do this at the user or group level. I’ve chosen to do it to my two main groups that cover all my staff and all my students.

  • Open up Workgroup Manager, and choose the group or user you wish to apply these settings to. Click on the toolbar item ‘Preferences’ and then the tab ‘Details’ to the right.
  • Click on the ‘Add’ button, and navigate to your own home folder. Choose “.GlobalPreferences.plist”. You’ll now see it appear in the Preferences pane.
  • Double click on the .GlobalPreferences item to edit it. You’ll notice that the settings have been put into ‘Often’ rather than ‘Always’ or ‘Once’. I’ve had some really odd behaviour with trying to get the global defaults domain to be managed ‘Always’, and ‘Often’ has been working happily for me, so let’s leave it set that way.
  • If you have any other items other than the ‘com.apple.AppleShareClientCore’ dictionary, delete them, unless of course you wish to manage those settings for your users as well. Click on the triangle next to the AppleShareClientCore dictionary to expand it.
  • Here you can see the relevant settings. Change the afp_wan_threshold and afp_wan_quantum values to the ones you’ve decided to use. Click on ‘Apply Now’.

Done! A really nice side effect of doing things this way is that you can now centrally manage a bunch of other afp connection settings, and perhaps most usefully, you can turn on AFP client side debugging using MCX for all your users. The AFP server itself isn’t particularly useful at giving debugging info, but the client is actually quite good. The footnote at the bottom will explain how to turn on client-side debugging.

Since I applied these changes, I’ve seen a drastic change in stability and performance under load. However… resolving this issue has pointed to another problem with Apple’s default settings, and I’ve noticed that the server is still slowing down under much heavier load, and still not using all the resources available to it.

There are a few oddities here though. It seems like global preferences shouldn’t take effect until the home directory is actually mounted, but perhaps by putting the prefs into MCX, we’re having them take effect before the actual home directory has mounted… ?

If anyone goes down the manual path setting, I’d like to hear if you’re finding these settings aren’t taking effect

2) Tweaking the maximum # of threads on on the server

The AFP server saw a change in 10.4, where permissions are now calculated by spawning a new thread for each connection with the effective uid of the connecting user. The problem is….

    root@server: ~ $ serveradmin settings afp:maxThreads
    afp:maxThreads = 40
  

That seems kind of low given the above, right? Apparently we should have at least one thread per client session, and this would include all your automounts…

I’ve set mine to 600 here like this:

    sudo serveradmin settings afp:maxThreads=600
  

This is something that again will depend upon what else your AFP server is doing (ideally not much…) and how many clients and automounts you have.

Footnote: Working out the latency of a client connection and AFP client side debugging.

You may have noticed two debug settings in the AppleShareClientCore dictionary as you were applying those settings.

    afp_debug_level = 6
    afp_debug_syslog = 0
  

If you set afp_debug_syslog to 1 (true), and add the following line to /etc/syslogd.conf,

  *.debug                                                 /var/log/debug.log
  

then you’ll see a wealth of debugging info (depending upon the value from 1 to 8 you’ve set for afp_debug_level) go to /var/log/debug.log

This isn’t something you should just willy-nilly turn on for all your clients, as tempting as it may be to finally have debug info for AFP… Rather use it to have a look at what goes on when a client connects to an AFP mount.

You’ll notice if you have the debug level up quite high that you’ll see a bunch of times reported to the debug log. I have yet to work out how to interpret those times to get the latency of a given client connection, and this is something I’d like to know so that I can tune ‘real’ values for my afp_wan_threshold.

So go forth! and work out how to do it. :)

As always, I’m not responsible for your server spontaneously combusting, your mileage may vary and this article may contain traces of nuts.

Many thanks must go to the afp548.com posse, Joel n Josh for extensively talking through these issues with me, as well as bringing them to my attention…

Leave a Reply