I’d been annoyed about a problem when connecting to my NexentaStor appliances over ssh, in that always the initial connection would take a good 20 seconds before I got the password prompt.
Subsequent logins would then run in a reasonable response time, so once that first login of the day had been completed this wouldn’t be an issue … but still, that initial delay really started to bother me.
The traditional way to start debugging such issues is to use the -v switch when invoking ssh and so that’s what I did and discovered we appeared to be waiting here:
debug1: Remote protocol version 2.0, remote software version Sun_SSH_1.5 debug1: no match: Sun_SSH_1.5 debug1: Authenticating to 10.0.0.1:22 as 'root' debug1: SSH2_MSG_KEXINIT sent
Searching Google for SSH_MSG_KEXINIT and slow ssh logins revealed a number of hits, all of which were wrong for this particular problem. I tried most of the ideas, including:
- Removing mdns from the /etc/nsswitch.conf file for the hosts: entry
- Ensuring client and server could resolve each other in the local /etc/hosts file
- Reviewing Illumos Bug #1983 (sshd problems)
- Disabling and unloading the crypto pkcs11_tpm.so module
- Attempting to prevent reverse DNS lookups
Having exhausted this list and reviewing a bunch of non-Solaris/Illumos based ssh issues, I went back to basics and decided to truss sshd on the appliance after rebooting the appliance and then logging into the console.
# truss -o /tmp/sshd.truss -Dealf -rall -vall -wall -p `pgrep sshd`
Then I attempted the initial ssh from my Mac, got the delay, logged in and then stopped the truss so I could review it.
The truss was huge, much bigger than I’d anticipated, which meant analysing it was going to take much longer than I’d planned.
# ls -lh /tmp/sshd.truss -rw-r--r-- 1 root root 137M Dec 7 13:21 /tmp/sshd.truss
Frustratingly there were no traces of the ‘SSH2_MSG_KEXINIT’ string (nor any KEX related string), which is probably because these messages are client side, rather than server side debug data.
Using the -D switch turns on time stamp deltas, useful for when you’re searching for a long pause between functions but alas where there were interesting delays between functions, these were uneventful forkx() and vforkx() calls:
# awk '$2 ~ /[0-9]*\.[0-9][0-9][0-9][0-9]/ {print $0}' /tmp/sshd.truss | sort -n -k2 | tail 1305/1: 45.9956 forkx() (returning as child ...) = 1272 1307/1: 46.0128 forkx() (returning as child ...) = 1305 1309/1: 46.0365 vforkx() (returning as child ...) = 1305 1311/1: 46.0539 forkx() (returning as child ...) = 1272 1313/1: 46.6702 forkx() (returning as child ...) = 1272 1315/1: 46.7773 forkx() (returning as child ...) = 1272 1317/1: 46.8037 forkx() (returning as child ...) = 1272 1319/1: 62.3412 forkx() (returning as child ...) = 1272 1321/1: 62.3736 forkx() (returning as child ...) = 1319 1323/1: 62.3934 forkx() (returning as child ...) = 1319
Still as I worked through the code I couldn’t help notice that we seemed to be doing an awful lot of work reading locale files:
1261/1: 0.0001 open("/usr/lib/locale//en_US.UTF-8/LC_CTYPE/LCL_DATA", O_RDONLY) = 5 ... 1261/1: 0.0002 read(5, 0x0807BC9C, 95232) = 94904 1261/1: R u n e M a g 1 U T F - 8\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 1261/1: \0\0\0\0\0\0\0\0 \0\0\0 \0\0\0 \0\0\0 \0\0\0 \0\0\0 \0\0\0 1261/1: \0\0\0 \0\0\0 \0\0\0 h\0\0\0 (\0\0\0 (\0\0\0 (\0\0\0 (\0\0\0 1261/1: \0\0\0 \0\0\0 \0\0\0 \0\0\0 \0\0\0 \0\0\0 \0\0\0 \0\0\0 1261/1: \0\0\0 \0\0\0 \0\0\0 \0\0\0 \0\0\0 \0\0\0 \0\0\0 \0\0\0 1261/1: \0\0\0 \0\0\0 H80\0\010A0\0\010A0\0\010A0\0\010A0\0\010A0\0\0 ... 1261/1: A3A7\0\0A2A7\0\0A5A7\0\0A5A7\0\0A4A7\0\0A7A7\0\0A7A7\0\0A6A7\0\0 1261/1: A9A7\0\0A9A7\0\0A8A7\0\0 AFF\0\0 ZFF\0\0 !FF\0\0 1261/1: 0.0054 brk(0x080949D0) = 0 1261/1: 0.0001 brk(0x080B29D0) = 0 1261/1: 0.0005 llseek(5, 0, SEEK_CUR) = 94904 1261/1: 0.0003 close(5) = 0 1261/1: 0.0003 open("/usr/lib/locale//en_US.UTF-8/LC_NUMERIC/LCL_DATA", O_RDONLY) = 5 . . . 1261/1: 0.0941 open("/usr/lib/locale//el_CY.UTF-8/LC_CTYPE/LCL_DATA", O_RDONLY) = 7 ... 1261/1: 0.0074 read(7, 0x080B1AFC, 95232) = 94904 1261/1: R u n e M a g 1 U T F - 8\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 1261/1: \0\0\0\0\0\0\0\0 \0\0\0 \0\0\0 \0\0\0 \0\0\0 \0\0\0 \0\0\0 1261/1: \0\0\0 \0\0\0 \0\0\0 h\0\0\0 (\0\0\0 (\0\0\0 (\0\0\0 (\0\0\0 1261/1: \0\0\0 \0\0\0 \0\0\0 \0\0\0 \0\0\0 \0\0\0 \0\0\0 \0\0\0 1261/1: \0\0\0 \0\0\0 \0\0\0 \0\0\0 \0\0\0 \0\0\0 \0\0\0 \0\0\0 1261/1: \0\0\0 \0\0\0 H80\0\010A0\0\010A0\0\010A0\0\010A0\0\010A0\0\0 1261/1: 10A0\0\010A0\0\010A0\0\010A0\0\010A0\0\010A0\0\010A0\0\010A0\0\0 1261/1: 10A0\0\010A0\0\084A0\0\084A0\0\084A0\0\084A0\0\084A0\0\084A0\0\0 1261/1: 84A0\0\084A0\0\084A0\0\084A0\0\010A0\0\010A0\0\010A0\0\010A0\0\0 ...
In addition whilst the ssh to the server had been running I also happened to notice that the disk access light was flashing quite heavily.
So just how many locales are installed on this system?
# cd /usr/lib/locale # ls -1 | wc -l 223 # locale -a C POSIX af_ZA.UTF-8 ar_AE.UTF-8 ar_BH.UTF-8 ar_DZ.UTF-8 ar_EG.UTF-8 ar_IQ.UTF-8 ar_JO.UTF-8 . . . zh_MO.UTF-8 zh_SG.UTF-8 zh_TW.UTF-8 # find . -type f -print | wc -l 1342 # grep 'open.*/usr/lib/locale' /tmp/sshd.truss | wc -l 2787 # dpkg -l | grep -i locale ii library-perl-5-encode-lo 40-0-3 Determine the locale encoding ii library-perl-5-gettext-l 40-0-0 Gettext-Locale PERL module ii locale-af 40-3-3 Afrikaans language support ii locale-ar 40-3-3 Arabic language support ii locale-as 40-3-3 Assamese language support ii locale-az 40-3-3 Azerbaijani language support ii locale-be 40-3-3 Belarusian language support ii locale-bg 40-3-3 Bulgarian language support ii locale-bg-extra 40-3-3 Bulgarian language support extra files ii locale-bn 40-3-3 Bengali language support . . . ii locale-zh-sg 40-3-3 Singapore Chinese language support ii locale-zh-tw 40-3-3 Traditional Chinese language support ii system-library-iconv-utf 1.0.2 Iconv modules for UTF-8 Locale ii text-locale 40-3-3 System Localization
Quite a few it turns out … could these be the cause of the initial slow down as we have to read them from disk and subsequently the data is cached in the ARC?
Whilst it’s a good thing there are so many locales available, for my particular requirements I really don’t need so many, so I set about removing ones that I’ll never need.
# dpkg -f locale-vi locale-ur locale-uk locale-ug locale-tr-extra locale-tr locale-th-extra locale-th (repeat for all non-required locale packages)
After completing this then rebooting the appliance, an ssh to this system completed much much quicker taking just a couple of seconds.