Tales from STMF land
WARNING
The following description related to a NexentaStor Enterprise system under a support contract, where the customer was entitled to log a support call with Nexenta.
If you are reading this because you have encountered a problem with your NexentaStor iSCSI/STMF service, then you are very welcome to read this document however you are STRONGLY encouraged to log a support call first!
A lot of the details within are carried out below the supported NMV/NMC interfaces and any commands run at the bash level may invalidate your warranty.
Furthermore it is very possible to completely break your STMF configuration, which would require restoring from a backup (if available) or in extreme cases, having to rebuild the ENTIRE configuration by hand.
When STMF goes wrong
The other day I took an escalation for a customer’s system that had not correctly shared out their iSCSI logical units (LUs) during a cluster failover. After this was failed back to the primary node and the problem views restored, I needed to figure out what had caused the problem.
A NexentaStor collector bundle was gathered from each node and I started to dig through the logs trying to work out the sequence of events.
It soon became clear that the High Availability RSF-1 cluster code had run into an error:
Operational Status: offline Config Status : uninitialized ALUA Status : enabled ALUA Node : 1
[3037 Feb 12 02:13:24] [pool1 S20zfs] Running map manager restore [3654 Feb 12 02:13:24] [ESC_ZFS_config_sync rsf-zfs-event] Updating saved cache file: /opt/HAC/RSF-1/etc/volume-cache/asgard.cache-live ==> /opt/HAC/RSF-1/etc/volume-cache/asgard.cache
WARNING: Failed to import LU /dev/zvol/rdsk/asgard/loki: STMF_ERROR_VE_CONFLICT: Adding this view entry is in conflict with one or more existing view entries, errno (9) Bad file number
- The STMF state is offline
- The STMF configuration is uninitialised
- When trying to add a view, we get back error STMF_ERROR_VE_CONFLICT
The error code means that we’re trying to add a view to an LU and that a view with the same details already exists, such that we would have a clashing/duplicate entry.
Now in ALUA mode (used with fibre channel) this isn’t so much of a problem because our hosts are either in active or standby mode, so we need to have views ready to go.
The bigger problem here is the fact that not only is STMF offline but we somehow failed to load the configuration.
Looking at the state of the STMF service and then the log file, it was clear something was wrong with the service:
bash# svcs -l svc:/system/stmf:default maintenance 3:02:23 svc:/system/stmf:default bash# less /var/svc/log/system-stmf:default.log ... [ Feb 12 02:08:04 Executing start method ("/lib/svc/method/svc-stmf start"). ] svc-stmf: Unable to load the configuration. See /var/adm/messages for details svc-stmf: For information on reverting the stmf:default instance to a previously running configuration see the man page for svccfg(1M) svc-stmf: After reverting the instance you must clear the service maintenance state. See the man page for svcadm(1M) [ Feb 12 02:08:04 Method "start" exited with status 1. ]
The logs tell us to review the messages file but alas there were again no STMF related errors here either.
What’s required at this point is to augment the STMF service trying to start, so we can determine what error code is coming back. This is fairly easy to do using truss and modifying the start method for this service:
bash# svccfg -s svc:/system/stmf svc:/system/stmf> setprop start/exec = "/usr/bin/truss -ulibstmf -ealfo /tmp/svc-stmf.truss -rall -vall -wall /lib/svc/method/svc-stmf start" svc:/system/stmf> quit bash# svcadm refresh svc:/system/stmf:default
bash# tail -f /var/svc/log/system-stmf:default.log &
bash# svcadm clear svc:/system/stmf:default
bash# less /tmp/svc-stmf.truss . . .
16041/1@1: -> libstmf:stmfLoadConfig()
16041/1: open("/etc/svc/volatile/repository_door", O_RDONLY) = 3
16041/1: getpid() = 16041 [16040]
16041/1: door_call(3, 0x0803B240) = 0
16041/1: data_ptr=803B260 data_size=4
16041/1: desc_ptr=0x803B278 desc_num=1
16041/1: rbuf=0x803B260 rsize=256
16041/1: close(3) = 0
.
.
.
16041/1: ioctl(3, _IORN(0x0, 18, 0), 0x0803B400) = 0
read 0 bytes
16041/1: ioctl(3, _IORN(0x0, 18, 0), 0x0803B400) Err#149 EALREADY
read 0 bytes
16041/1: close(3) = 0
16041/1@1: <- libstmf:stmfLoadConfig() = 32782
Error code 32782 is not seen in the source code, however I’ve been looking at Solaris and Illumos based systems for too many years and I know that we tend to define most of these as hexadecimal numbers, whereas truss is giving us the decimal conversion. So just what is that in hex?
$ bc bc 1.06 Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc. This is free software with ABSOLUTELY NO WARRANTY. For details type `warranty'. obase=16 32782 800E
And this number is definitely found in the source code, here:
64#define STMF_STATUS_ERROR 0x8000
78#define STMF_ERROR_VE_CONFLICT (STMF_STATUS_ERROR | 0x0e)
For those of you who are eagle eyed, you’ll have noticed this is the same error we saw during the view restore section in the cluster logs.
Where is the STMF configuration held anyhow?
bash# svccfg export -a svc:/system/stmf > /var/tmp/stmf.export.xml
<property_group name='view_entry-6-600144F0DEA04F0000005501BF0E0085' type='application'> <propval name='all_hosts' type='boolean' value='false'/> <propval name='all_targets' type='boolean' value='false'/> <propval name='host_group' type='ustring' value='artists'/> <propval name='lu_nbr' type='opaque' value='008a000000000000'/> <propval name='target_group' type='ustring' value='FC_Users'/> </property_group>
# Walk through the XML data, searching for the property group view_entry lines only.
# When we find an entry, extract the lu and view number, then move on to the HG and TG
# lines in the data.
#
# As we’re using an associative array we can use the LU, TG and HG (minus the view number)
# as a unique index into the array, storing the view number as the value.
# If we used the view number as part of the searchable key, nothing would ever clash.
#
# Once we’ve constructed the viewline, push that string as a key into the array with the
# view number as the value.
#
# Finally all we now need to do is check if the value exists in the array – if not, push
# the value, otherwise we’ve found a clash.
#
bash$ ./stmf-xcheck.pl /var/tmp/stmf.export.xml
Searching for duplicate views in manifest If any duplicates are found, they will appear as - LU : HG : TG 600144F0DEA04F000000544E94060027:artists:FC_Users - view entry 8 clashes with entry 7 600144F0DEA04F00000054628EF9004C:renderfarm:FC_Users - view entry 8 clashes with entry 0
bash# svccfg -s stmf svc:/system/stmf> setprop lu-600144F0DEA04F000000544E94060027/ve_cnt = 8 svc:/system/stmf> delprop lu-600144F0DEA04F000000544E94060027/view_entry-8-600144F0DEA04F000000544E94060027 svc:/system/stmf> delpg view_entry-8-600144F0DEA04F000000544E94060027
svc:/system/stmf> setprop lu-600144F0DEA04F00000054628EF9004C/ve_cnt = 7
svc:/system/stmf> delprop lu-600144F0DEA04F00000054628EF9004C/view_entry-8-600144F0DEA04F00000054628EF9004C
svc:/system/stmf> delpg view_entry-8-600144F0DEA04F00000054628EF9004C svc:/system/stmf> quit
bash# svcadm clear stmf bash# svcs stmf STATE STIME FMRI online 8:27:46 svc:/system/stmf:default
bash# stmfadm list-state Operational Status: online Config Status : initialized ALUA Status : enabled ALUA Node : 0
Oh I should point out that we needed to restore the STMF start method back to it’s default, non-augmented state:
bash# svccfg -s svc:/system/stmf svc:/system/stmf> setprop start/exec = "/lib/svc/method/svc-stmf start" svc:/system/stmf> quit bash# svcadm refresh svc:/system/stmf:default
Glossary
iSCSI is an acronym for Internet SCSI (Small Computer System Interface), an Internet Protocol (IP)-based storage networking standard for linking data storage subsystems.
By carrying SCSI commands over IP networks, the iSCSI protocol enables you to access block devices from across the network as if they were connected to the local system. COMSTAR provides an easier way to manage these iSCSI target devices.
COMSTAR utilizes a SCSI Target Mode Framework (STMF) to manage target storage devices.