1. Introduction

These notes are intended to cover, albeit tersely, the major issues for RAID operations with FSL10 (see the FSL10 Installation document). The disk lay-out has changed significantly since FSL9, which required updating the scripts that are used. In addition, the scripts have been extensively revised to provide more protection from possible errors in how they are used.

All operations and scripts in this document require root privileges unless otherwise indicated.

2. Guidelines for RAID operations

The FSL10 RAID configuration normally uses two disks configured according to the FSL10 installation instructions (see the FSL10 Installation document). Below are mandatory and recommended guidelines.

2.1. Mandatory practices

These practices are fundamental to the operation of the RAID.

  1. Never mix disks from different computers in one computer.

  2. Never split up a RAID pair unless already synced, check with mdstat

A RAID pair (kept together and in order) can be removed or moved between computers if need be. A disk rotation (or initializing a new disk) is probably the only reason to split a pair.

Note
When booting a disk from a RAID by itself, you may see ~20 of the volume group not found error messages, then the machine will boot. These error messages only appear like this the first time a disk from the RAID is booted without its partner.

These recommendations are intended to provide consistent procedures and make it easier to understand any issues, if they occur.

  1. Always use the lowest numbered interface as the primary and the next lowest numbered as secondary

  2. Make the upper (or left) slot the primary, the lower (or right) slot the secondary. If necessary, change the internal cabling to make it so.

  3. Label the slots as primary and secondary as appropriate.

  4. Always boot for a refresh/blank with the primary slot turned on and the secondary slot turned off: so it is clear which is the active disk

  5. Label the disks (so visible when in use) with the system name and number them 1, 2, and 3, …​

  6. Label the disks (so visible when in use) with their serial numbers, either from mdstat when only one disk inserted or by examining the disk

  7. For reference, place the disk serial numbers in a file with their corresponding numbers, e.g.:

    /root/DISKS.txt
    1=ZC1B1YCC
    2=ZC1A6WZ1
    3=ZC1AHENM
  8. When rotating disks, keep the disks in cyclical order (primary, secondary, shelf): 1, 2, 3; then 2, 3, 1;, then 3, 1, 2, then 1, 2, 3; and so on

  9. Rotate disks for a given computer at least once a month, and before any updates

  10. If you have a spare computer (and/or additional systems), keep the disks in the same sequence on all the computers

  11. Do not turn a disk off while the system is running. The only time a key switch state should be changed while the system is running is to add a disk for a blank or refresh operation.

  12. Set your BIOS to allow hot swapping of disks for both the primary and secondary controllers. This is necessary to use the RAID procedures described in this document.

3. Disk Rotation

This section describes the disk rotation procedure. It is used to make periodic updates of the shelf disk.

Note
Your BIOS must be set to allow hot swapping of disks, particularly for the secondary controller (it should also be set for the primary controller).
Tip
If you do not have access to the root account, you may have sudo access to the privileged commands. If so, you will need to run the shutdown -h now and refresh_secondary commands with sudo. This is true for other privileged commands used elsewhere in this document.
  1. Wait until the RAID is not recovering, check with mdstat

  2. Shutdown the system, e.g., shutdown -h now

  3. Take the disk from primary slot, put it on the shelf

    We recommend that you label the disk immediately, including the date (and possibly the time). In addition to getting the disk labeled before it is put away, it will reduce the chances that it is confused with the old shelf disk.

  4. Move the disk from the secondary slot to the primary slot, keyed-on

  5. Move the old shelf disk to the secondary slot, keyed-off

  6. Boot (primary keyed-on, secondary keyed-off)

  7. Run refresh_secondary

  8. Key-on the secondary slot when prompted

  9. If the script rejects the disk (and stops with an error), seek expert advice. Be sure to note any messages so they can be reported.

  10. If the disk is accepted, let the refresh run to completion (the progress report can be aborted as indicated). The system can be used in the meantime, but may be a little slow.

4. Recoverable testing

This section describes a method for testing updates in a way that provides a relatively easy recovery option if a problem occurs. Should that recovery fail for some reason, it is still possible to recover with the shelf disk as described in the Recover from a shelf disk section below.

Seek expert advice for this, but the basic plan is given below:

Note
Your BIOS must be set to allow hot swapping of disks for both the primary and secondary controllers.
  1. If a rotation hasn’t just been completed, perform one (as an extra backup)

  2. Before proceeding verify that there is no recovery in progress, check with mdstat

  3. Shutdown the system, e.g., shutdown -h now

  4. Key-off the primary slot

  5. Reboot (primary keyed-off, secondary keyed-on)

  6. Install and test the update

    The update and testing will occur on the secondary disk only.

Tip

If an update is relatively minor or the envisaged testing is intended to be of short duration and success is likely, expert users may wish to make use of the drop_primary script to split the RAID pairing in place of the reboot cycle method described above. Note that some (hopefully minor) data loss is possible on the primary (backup) disk as it is removed from the RAID whilst all the filesystems are still mounted read/write. Hence this script should only be used on a unloaded or single-user system. The advantage of using this script is that returning the system to normal operation after a successful update requires only the use of recover_raid - no reboot is required at all.

Warning
Do NOT use the drop_primary script for kernel updates or any other such testing that could affect grub and/or require you to reboot in order to evaluate the success thereof.

If the update is deemed successful:

  1. Key-on the primary slot

  2. Run recover_raid to add the primary slot disk back into the RAID.

    The recover_raid script will fail if the disk hasn’t spun up and been recognized by the kernel. It is perfectly fine to try several times until it succeeds.

  3. Once the recovery completes (this may only take a few minutes), reboot the system.

    This step is necessary to return the disk in the primary slot to be sda.

  4. Once the system has booted, the system has been successfully updated.

Alternatively, if the update is deemed to have failed, the system can be recovered as follows:

  1. Shutdown the system, e.g., shutdown -h now

  2. Key-off the secondary slot

  3. Key-on the primary slot

  4. Reboot (primary keyed-on, secondary keyed-off)

  5. Run blank_secondary

  6. Key-on the secondary slot when prompted

  7. Answer y to blank

  8. Run refresh_secondary

  9. Once the refresh is complete (this may take several hours), you have recovered to the original state.

    The system can be used for operations while the refresh is in progress.

5. Recover from a shelf disk

The section describes how to recover from a good shelf disk. This might be needed, e.g., if it is discovered that a problem has developed on the RAID pair since the last disk rotation. This might be due to a bad update of some type or some other problem.

Tip
Before using this procedure, it should be considered whether the damage is extensive enough to require starting over from the shelf disk or whether it can be reasonably repaired in place.
Important
This will only produce a good result if the shelf disk is a good copy.
Warning
Do not use this procedure if a problem with computer caused the damage to the RAID.
Note
Your BIOS must be set to allow hot swapping of disks, particularly for the secondary controller (it should also be set for the primary controller).
  1. Shutdown the system, e.g., shutdown -h now

  2. Take the disks from both the primary and secondary slots, set them aside.

  3. Insert the good shelf disk in the primary slot, keyed-on.

  4. Insert the disk that is next in cyclic order (from the ones set aside) in the secondary slot, keyed-off.

  5. Reboot (primary keyed-on, secondary keyed-off)

  6. Run blank_secondary

  7. Key-on the secondary slot when prompted

  8. Answer y to blank

  9. Run refresh_secondary

    Once the refresh has entered the recovery phase (the progress display is being shown onscreen), the system can be used for operations, if need be. In that case, the rest of this procedure can be completed when time allows.

  10. Wait until the RAID is not recovering, check with mdstat

  11. Shutdown the system, e.g., shutdown -h now

  12. Take the disk from primary slot, put it on the shelf

  13. Move the disk from the secondary slot to the primary slot, keyed-on

  14. Insert the remaining disk, that was set aside, in the secondary slot, keyed-off.

  15. Reboot (primary keyed-on, secondary keyed-off)

  16. Run blank_secondary

  17. Key-on the secondary slot when prompted

  18. Answer y to blank

  19. Run refresh_secondary

    Once the refresh has entered the recovery phase (the progress display is being shown onscreen), the system can be used for operations, if need be.

  20. When the refresh is complete, you have recovered to the state of the previous good shelf disk.

6. Initialize a new disk

If one or more of the disks in the set for the RAID fails, you can initialize new ones to replace them.

Important
The new disks should be at least as large as the smallest of the remaining disks.

The sub-sections below cover various scenarios for initializing one new disk to complete a set of three, i.e., one of three disks in a set has failed. It is assumed that you want to maintain the cyclic numbering of the disks for rotations (but that is not required). It should be straightforward to adapt them to other cases.

If you need to initialize more than one disk, please follow the instructions in the Setup additional disk section of the FSL10 Installation document.

6.1. Currently two disks are running in the RAID

This case corresponds to not having a good shelf disk.

  1. Wait until the RAID is not recovering, check with mdstat

  2. Shutdown the system, e.g., shutdown -h now

If the disks are in cyclical order (i.e., primary, secondary are numbered in order: 1, 2, or 2, 3, or 3, 1), you should:

  1. Take the disk from primary slot, put it on the shelf

  2. Move the disk from the secondary slot to the primary slot, keyed-on

If the disks are not in cyclical order (i.e., primary, secondary are numbered in order: 1, 3, or 2, 1, or 3, 2), you should:

  1. Take the disk from secondary slot, put it on the shelf

In either case, finish with:

  1. Put the new disk in the secondary slot, key-off.

  2. Boot (primary keyed-on, secondary keyed-off)

  3. Run blank_secondary

  4. Key-on the secondary slot when prompted

  5. Answer y to blank

  6. Run refresh_secondary

  7. Once the refresh is complete, the disk can be used normally

  8. Label the new disk with its system name, number, and serial number.

6.2. Currently one disk is running in the RAID, but two are installed

In this case, there is a good shelf disk. The strategy used avoids overwriting it until there are three functional disks again.

  1. Use mdstat to determine which disk is running, compare the serial number to those shown on the labels or inspect the disks to determine their serial numbers.

  2. Shutdown the system, e.g., shutdown -h now

  3. Remove the non-working disk.

  4. Move the working disk to the primary slot, if it isn’t already there, keyed-on.

  5. Put the new disk in the secondary slot, keyed-off.

  6. Boot (primary keyed-on, secondary keyed-off)

  7. Run blank_secondary

  8. Key-on the secondary slot when prompted

  9. Answer y to blank

  10. Run refresh_secondary

  11. Once the refresh is complete, the disk can be used normally

  12. Label the new disk with its system name, number, and serial number.

If the disks are not in cyclical order (i.e., primary, secondary are numbered in order: 1, 3, or 2, 1, or 3, 2), on the next disk rotation you should move the secondary disk to the shelf instead of moving the primary.

6.3. Currently one disk is installed and running

In this case, the shelf disk is assumed to be healthy, but older. Again, the strategy is to avoid overwriting it until there is a full complement of disks available.

If the working disk is not in the primary slot:

  1. Shutdown the system, e.g., shutdown -h now

  2. Move the working disk to the primary slot, keyed-on.

  3. Boot (primary keyed-on, secondary empty)

Then in any event:

  1. Put the new disk in the secondary slot, keyed-off.

  2. Run blank_secondary

  3. Key-on the secondary slot when prompted

  4. Answer y to blank

  5. Run refresh_secondary

  6. Once the refresh is complete, the disk can be used normally

  7. Label the new disk with its system name, number, and serial number.

If the disks are not in not in cyclical order (i.e., primary, secondary are numbered in order: 1, 3, or 2, 1, or 3, 2), on the next disk rotation you should move the secondary the shelf instead of the primary.

7. Script descriptions

This section describes the various scripts that are used for RAID maintenance.

7.1. mdstat

This script can be used by any user (not just root) to check the status of the RAID. It is most useful for checking whether a recovery is in process or has ended, but is also useful for showing the current state of the RAID, including any anomalies.

The script also lists various useful details for all block devices (such as disks) that are currently connected, including their model and serial numbers where applicable.

7.2. refresh_secondary

This can be used to refresh a shelf disk for the RAID as a new secondary disk (sdb) as part of a standard three (or more) disk rotation.

Initially, the script performs some sanity checks to confirm that the RAID /dev/md0:

  1. Exists.

  2. Is not a clean state, i.e., it needs recovery.

  3. Is not already recovering, i.e., is in a recoverable state.

Additional checks are performed to confirm that the content the script intends to copy is where it expects it to be and has the right form. Any primary disk (sda) will be rejected that:

  1. Is not part of the RAID (md0)

  2. Has a boot scheme other than the BIOS or UEFI set up as described in the FSL10 Installation Document.

To ensure that only an old shelf disk for this system is overwritten, any secondary disk (sdb) will be rejected that:

  1. Was loaded (slot keyed-on) before starting the script

    Unless overridden by -A or previously loaded by this or the blank_secondary script.

  2. Is already part of RAID md0

    Which should only happen if run incorrectly with -A (or other interfering commands have been executed) or the disk has fallen out of the RAID due to failure.

  3. Has a RAID from a different computer, i.e., foreign

    Technically this could also be another RAID from the same computer, but not of a properly set up FSL10 computer, which should have only the one RAID

  4. Has any part already mounted

    Again catching misuse of the -A option.

  5. Has a different boot scheme than the primary

    And hence is probably from a different computer.

  6. Has a different RAID UUID

    This would be a disk from a different computer. Though whether this check can actually trigger after the test for a foreign RAID above remains to be seen.

  7. Was last booted at a future TIME (possibly due to a mis-set clock or clocks)

  8. Has a higher EVENT count, i.e., is newer (but see the WARNING item below)

  9. Has been used (booted) separately by itself

  10. Has a different partition layout from the primary

  11. Is smaller than the size of the RAID on the primary disk.

If any of the checks reject the disk, we recommend you seek expert advice; please record the error so it can be reported.

The checks are included to make the refresh process as safe as possible, particular at a station with more than one FSLx computer. We believe all the most common errors are trapped, but the script should still be used with care.

Warning
The check on the EVENT counter is intended to prevent accidentally using the shelf disk to overwrite a newer disk from the RAID. This check can be over-run if the primary has run for a considerable period of time before the refresh is attempted. This should not be an issue if the refresh is attempted promptly after the shelf disk is booted for the first time by itself and the RAID was run on the other disks for more than a trivial amount of time beforehand.

If the disk being refreshed is from the same computer and has just been on the shelf unused since it was last rotated, it is safe to refresh and should be accepted by all the checks. In other words, normal disk rotation should work with no problems.

If the primary and/or secondary disks are removable, the user will be provided with some information about the disks and given an opportunity to continue with Enter or abort with Ctrl+C. Typically, if a USB disk is identified as the primary or secondary, one would not want to continue. However for some machines, the SATA disks that are the primary and/or secondary may be marked removable if they are hot swappable, but would still be appropriate to use.

This script requires the secondary disk (sdb) to not be loaded, i.e., the slot turned off, when the script is started. However, it has an option, -A (use only with expert advice), to “Allow” an already loaded disk to be used. It is intended to make remote operation possible and must be used with extra care.

If the disk is turned on (when prompted) during the script, it will automatically be “Allowed” by both this script and blank_secondary, which also supports this feature. This allows (expert use only), after a failed refresh_secondary, running blank_secondary then rerunning refresh_secondary, all without having to shutdown, turn the disk off, reboot, start the script, and turn the disk on for each.

The refresh will take several hours. The script provides a progress indicator that can safely be aborted (using Ctrl+C as described by the on-screen instructions) if that is preferred. An active screen saver may make it difficult to see the progress after awhile, but pressing shift or some other key should make it visible again. If you abort the progress indicator, you can check the progress later with mdstat. The system can be used normally while it refreshing, but it may be a little slow.

Once the progress indicator is updating, it is safe to reboot the computer if it is needed.

7.3. blank_secondary

This script should only be used with expert advice.

It can be used to make any secondary disk (sdb) refreshable, if it is big enough. It must be used with care and only on a secondary disk that you know is safe to erase. Generally speaking you don’t want to use it with a disk from a different FSLx computer, except for very unusual circumstances, see Recovery scenarios section for some example cases. It will ask you to confirm before blanking.

It will reject any secondary disk (sdb) that:

  1. Was loaded (slot keyed-on) before starting the script

    Unless you have just loaded it through refresh_secondary's auspices or used the -A option to “Allow” it (see below).

  2. Is still part of the RAID md0

    Which should only happen if run incorrectly with -A (or other interfering commands have been executed).

  3. Has any partition already mounted

    Again catching misuse of the -A option.

  4. Has a partition that is in RAID md0

    This is essentially redundant with the “Is still part of the RAID md0” check above, but is included out of an abundance of caution.

  5. Has a partition that is included in any RAID.

If the primary disk is removable, the user will be provided with some information about the disk and given an opportunity to continue with Enter or abort with Ctrl+C. Typically, if a USB disk is identified as the primary, one would not want to continue. However for some machines the SATA disk that is the primary may be marked removable if it is hot swappable, but would still be appropriate to use.

This script requires the secondary disk (sdb) to not be loaded, i.e., the slot turned off, when the script is started. However, it has an option, -A (use only with expert advice), to “Allow” an already loaded disk to be used. It is intended to make remote operation possible and must be used with extra care.

If the disk is turned on (when prompted) during the script, it will automatically be “Allowed” by both this script and refresh_secondary, which also supports this feature. This allows you to then run refresh_secondary immediately without having to shutdown, turn the disk off, reboot, start the script, and turn the disk on.

Note
On the 32-bit i386 platform, due to a broken vgremove binary, this script can give WARNINGs when erasing disks that were used for LVM. These warnings can safely be ignored - the disk will be successfully blanked (despite vgremove having segmentation-faulted instead of performing the requisite action thereby causing pvremove to complain about the VG still being active.)

7.4. drop_primary

This script is only for use with expert advice.

This script can be used to drop a primary disk (sda) out of a RAID pair (by marking it as failed) so that it can act as a safety backup during major upgrades or other significant changes.

Initially, the script performs some sanity checks to confirm that the RAID /dev/md0:

  1. Exists.

  2. Is in a clean state, i.e., both disks are present and no recovery is currently in progress.

  3. Contains the primary disk (sda) as a member.

If the primary disk is removable, the user will be provided with some information about the disk and given an opportunity to continue with Enter or abort with Ctrl+C. Typically, if a USB disk is identified as the primary, one would not want to continue. However for some machines the SATA disk that is the primary may be marked removable if it is hot swappable, but would still be appropriate to use.

Note
This script is non-destructive in nature and its effect can easily be reversed by running the recover_raid script mentioned below.

7.5. recover_raid

This script is only for use with expert advice.

This script can be used to recover a disk (sda or sdb) that has fallen out of the RAID array, becoming inactive. A disk can fall out of the array for several possible reasons, including:

  1. A real disk fault of some sort, including one caused by turning it off whilst it is still in use.

  2. Using the mdadm command with -f option to mark it as faulty.

  3. Turning it off whilst the system is shutdown and booting without it.

  4. Using the drop_primary script.

This script is designed to be used only with a set of disks that were most recently used together in an active RAID. It is recommended only to use this script if the key switches for the disks have not been manipulated since the inactive disk fell out of the RAID; in this case it should always be safe.

Note
The inactive disk is either failed or missing. It is failed if it was either marked failed by hand or dropped out of the RAID due to disk errors. It is missing if either the system was rebooted with the disk failed or physically missing or it was manually marked removed. You can check which state an inactive disk is in with mdadm --detail /dev/md0 — which lists failed as faulty but a missing disk will not appear at all.
Note
The active disk is the one the system is still running on.
Tip
It is okay to use this script even if the inactive disk fell out the RAID a (long) long time ago (in a galaxy far, far away) and/or there have been extensive changes to the active disk. It is also okay to use if the system was rebooted (even multiple times) or the active disk was used (booted) separately by itself since the inactive disk fell out of the RAID.
Warning
This script must NOT be used if the inactive disk has been changed in any way e.g., by being used (booted) separately (which is caught by the script) or refreshed against some other disk, or if the active disk has been used to refresh any other disk in the interim. In particular, the script must NOT be used to refresh a shelf disk — only use refresh_secondary for that purpose.

It normally works on md0, but a different md device can be specified as the first argument.

It will refuse to recover the RAID if the RAID:

  1. Does not need recovery

  2. Is not in a recoverable state, e.g., is already recovering

or if any missing disk:

  1. Has a later modification TIME than the active disk

  2. Has a higher EVENT count, i.e., is newer, than the active disk

  3. Has been used (booted) separately (as mentioned above in the WARNING item)

or if no matching missing disk can be found.

The recovery may be fairly quick, as short as a few minutes, if the inactive disk is relatively fresh. There is an ongoing progress display that can be terminated early with Ctrl-C, without affecting the recovery. If you abort the progress indicator, you can check the progress with mdstat. The system can be used normally while it recovering, but it may be a little slow.

7.6. raid-events

The mdmonitor service can be configured to use the raid-events script to send email reports on RAID rebuilds and checks. This is most useful for getting reports for the start and end of a RAID build triggered by refresh_secondary. The script will also report on the start and end of any other RAID rebuilds, including those triggered by the recover_raid script. Checks are triggered periodically to verify the integrity of the RAIDs.

The emails are sent to root, then typically redirected to oper,` and then forwarded to off-system accounts that may have their email read more frequently. There are four different possible subject lines used in the emails:

  • Rebuild Running on device

    Note
    Sometimes for a rebuild started by refresh_secondary, this message may be sent about 20 minutes after the rebuild has started. The cause of this is not entirely understood, but the message is eventually sent.
  • Rebuild Ended state on device

  • Check Running on device

  • Check Ended state on device

where:

  • device is the RAID device, e.g., /dev/md/0

  • state is OKAY if the final state was not degraded; DEGRADED, if it was degraded.

The body of each email is the output of the mdstat script at the time the message was sent.

7.6.1. Checks

The checking process is triggered by /etc/cron.d/mdadm on the first Sunday of each month. It uses the /usr/share/mdadm/checkarray script and takes a similar amount of time as a rebuild of the RAID triggered by refresh_secondary.

7.6.2. Installing raid-events

To install the script, use the following commands as root:

cd /usr/local/sbin
cp ~/fsl10/RAID/raid-events .
chmod u+x raid-events
cat <<EOF >>/etc/mdadm/mdadm.conf

PROGRAM /usr/local/sbin/raid-events
EOF

And then reboot.

7.6.3. Disabling checking

If the checking process causes performance problems at inconvenient times, there are at least three options for dealing with it:

  • Disable the AUTOCHECK option in /etc/default/mdadm

    This is suitable if the RAID is rebuilt monthly using refresh_secondary. In this case, the check is superfluous.

  • Change the time at which it runs as configured in /etc/cron.d/mdadm

  • Cancel a running check, with:

    /usr/share/mdadm/checkarray --cancel --all

7.7. refresh_spare_usr2

This script is not part of RAID operations per se, but is included in this document for completeness. In a two system configuration (operational and spare), it is used to make a copy of the operational system’s /usr2 partition on the spare system. Normally this partition holds all the operational FS programs and data.

A full description of the features of the script are available from the refresh_spare_usr2 -h output.

Important
This script should be installed on the spare system only.
Tip

A recommended monthly backup strategy is to do a disk rotation on both systems. Once the RAIDs on both systems are recovering you can log-out of both systems and then login into the spare system again to start refresh_spare_usr2.

While a refresh_spare_usr2 with two nearly synchronized /usr2 partitions is fairly fast, the recovery of the RAIDs may increase the amount of time required by about a factor of three.

Once refresh_spare_usr2 completes, it is safe to reboot, even if a recovery is still ongoing. The only requirement is to reboot the spare system before the FS is run on it again.

A feature of this approach is that it will make the spare system shelf disk a deeper back-up than the spare system RAID disks.

7.7.1. Installing refresh_spare_usr2

Warning
For this script to work most usefully, the operational and spare systems should have the same set-up including particularly the same user accounts with same UIDs and GIDs in parallel for all accounts, particularly for those that have home directories on /usr2, as well as other OS set-up information the FS may depend on such as /etc/hosts and /etc/ntp.conf.
Tip
If you are unwilling or unable to use the forced command approach below for the root account, you may find the approach of using sudo in a regular account a usable alternative. For details on that approach, please see the Installing refresh_spare_usr2 with CIS hardening subsection of the CIS Hardening for FSL10 document.

All the steps below must be performed as root on the specified system. You should read all of each step or sub-step before following it.

  1. On the spare system:

    1. Install refresh_spare_usr2. Execute:

      cd /usr/local/sbin
      cp -a /root/fsl10/RAID/refresh_spare_usr2 refresh_spare_usr2
      chown root.root refresh_spare_usr2
      chmod a+r,u+wx,go-wx refresh_spare_usr2
    2. Customize refresh_spare_usr2, following the directions in the comments in the script:

      1. Comment-out the lines (add leading #s):

        echo "This script must be customized before use.  See script for details."
        exit 1
      2. Change the operational in the line:

        remote_node=operational

        to the alias (preferred), FQDN, or IP address of your operational system.

    3. Create and copy a key for root. Execute:

      ssh-keygen
      ssh-copy-id root@operational

      where operational is the alias, name, or IP of your operational system.

      Note
      If root already has a key, you only need the second command above to copy it to the spare system.
      Tip
      It is recommended to not set a passphrase.
  2. On the operational system:

    1. Install the rrsync script. Execute:

      gunzip -c /usr/share/doc/rsync/scripts/rrsync.gz >/usr/local/bin/rrsync
      ln -s /usr/local/bin/rrsync /usr/bin/rrsync
      chmod u+x,go-x /usr/local/bin/rrsync
    2. Set the root account to only allow a forced command with ssh:

      1. Replace the ssh-rsa at the start of the line (probably the only one) in ~root/.ssh/authorized_keys for the root account on the spare system with:

        command="rrsync -ro /usr2" ssh-rsa

        Tip
        If your spare system is registered with DNS, you can provide some additional security by adding from="node"  (note the trailing space) at the start of the line, where node is the FQDN or IP address of the spare system. It may be necessary to provide the FQDN, IP address, and/or alias of the spare system in a comma separated list in place of node to get reliable operation.
      2. Set sshd to only allowed forced commands for root by un-commenting the PermitRootLogin line and changing the second field from prohibit-password to forced-commands-only.

      3. Restart `sshd. Execute:

        systemctl restart sshd

7.7.2. Using refresh_spare_usr2

  1. As part of a monthly backup, you would usually start a disk rotation on both the operational and spare systems first. Once both systems are recovering, you should log out of both systems.

    Important
    Before proceeding, make sure that no one is logged into either system and that no processes are running on /usr2 on either system, particularly the FS.
  2. Login on the spare system. The best choice for this is as root on a local virtual console text terminal.

    Tip
    Logging in as a non-root user is acceptable. Any means can be used: a text console, ssh from another system (preferably not the operational system), or the graphics X display. In these cases, you must promote to root using su (or execute the script with sudo for CIS hardened systems).
  3. Execute the script:

    refresh_spare_usr2

    Answer the question y if it is safe to proceed.

  4. Log out of the system.

  5. Wait until the script has finished before logging in again and resuming other activities on the systems.

    An email will be sent to root when the script finishes. If your email to root is being forwarded to a mailbox off the system, you can use receipt of that message (and that it shows no errors) as the indication that it finished successfully.

    Alternatively, you can examine the logs (before starting the script) in /root/refresh_spare_usr2_logs to see how long previous script uses took. When at least that much time has elapsed, you can login and can check the log for the current script use to verify that it has finished.

    Caution

    Generally speaking, it is best to not login to either the spare or operational system while the script is running. Under normal circumstances the script should run quickly enough that this does not cause a significant burden. If it is necessary to login to either system, the following paragraphs in this CAUTION cover the relevant considerations.

    If you do login to the spare system, it is best to not use an account with a home directory on the /usr2 partition (logging in as root on a text console is okay) or otherwise access that partition while the script is running. In any event, activity on /usr2 should be minimized.

    It is possible to use the operational system while the script is running if necessary, but this should be avoided if possible and activity on the /usr2 partition should be minimized. You should not expect any changes on the operational system /usr2 that occur after the script starts to be propagated to the spare system. If any files are deleted before they can be transferred, there will be a warning file has vanished: "file", for each such file, and there will be a summary warning that starts with rsync warning: some files vanished before they could be transferred, but without additional warnings or errors, the transfer should otherwise be successful.

    In case you have logged into either system while the script is running, you can touch-up the copy on the spare system, by rerunning the script after logging out.

  6. If the script finished with no problems, you can reboot the spare system as soon as is convenient. You may reboot even if the RAID is recovering, but you can wait until the recovery is complete. The only requirement is to reboot before the FS is run again on the spare system.

8. Multiple computer set-up

You may have more than one FSL10 computer at a site, either an operational and spare for one system and/or additional computers for a additional systems. In this case, we recommend that you do a full setup of each computer from scratch from FSL10 installation notes. The main, but not only, reason for this is to make sure each RAID has a unique UUID, so the refresh_secondary script will be able to help you avoid accidentally mixing disks while doing a refresh. While in principle is it possible to do one set-up and clone the configuration to more disks and then customize for each computer, we are not providing detailed instructions on how to do that at this time.

It is recommended that the network configuration on each machine be made independent of the MAC address of the hardware. This will make it possible to move a RAID pair to a different computer and have it work on the network. Please note that the IP address and hostname is tied to the disks and not the computers. For information on how to configure this, please see the (optional) Network configuration changes section of the FSL10 installation document.

The configuration of the system outside of the /usr2 partition between operational and spare computers should be maintained in parallel so that the same capabilities are available on both. In particular, any packages installed on one should also be installed on the other. In addition, it is important that the user and group IDs of all users on the operational and spare computers be same. It should not be necessary to maintain parallelism with OS updates, but that is recommended as well. It is recommended to maintain maintenance parallelism with other independent operational/spare systems at a site as well (this may enable additional recovery options in extreme cases).

9. Recovery scenarios

The setup provided by FSL10 provides several layers of recovery in case of problems with the computers or the disks. Each system has a shelf disk, which can serve as a back-up. Additionally if there is a spare computer for each operational computer, there are additional recovery options. If there are other FSL10 computers at the site, it may be possible in extreme cases to press those computers and/or disks into service, particularly if they have been maintained in parallel.

A few example recovery scenarios are described below in rough order of likelihood of being needed. None of them are very likely to be needed, particularly those beyond the first two.

Important
In any scenario, if disks and/or a computer have failed, they should be repaired or replaced as soon as feasible.

9.1. Operational computer failure

This might be caused by a power supply or other hardware failure. If the contents of the operational RAID are not damaged, the RAID pair can be moved to the spare computer until the operational computer is repaired. Once the RAID has been moved, whether the contents have been damaged can be assessed. It will be necessary to move connections for any serial/GPIB devices to the spare computer as well.

Tip

If the disks do not connect to network after first booting in a different computer:

  1. Shut the system down.

  2. Remove the power cord.

  3. Press and hold the power button for 15 or more seconds.

    The goal is drain any residual energy in the computer in order to completely reset the NIC.

  4. Reboot and try again.

This has been seen to solve the problem, perhaps because it forces the NIC to re-register with ARP. Waiting longer may also solve the problem.

9.2. One disk in the operational computer RAID fails

This should not interrupt operations. The computer should continue to run seamlessly on the remaining disk. If the system is rebooted in this state, it should use the working disk. At the first opportunity, usually after operations, the recover_raid script can be tried to restore the disk to the RAID. If that doesn’t work, the disk may have failed and may need to replaced (it may worthwhile to try blanking and refreshing it first). If the disk has failed, it should be removed and a disk rotation should be performed (with the still good disk in the primary slot) to refresh the shelf disk and make a working RAID. The failed disk should be repaired or replaced with a new disk that is at least as large. The blank_secondary script should be used to erase the new disk before it is introduced into the rotation sequence. See the Initialize a new disk section above for full details on initializing a new disk.

9.3. Operational computer RAID corrupted

As well as a large scale corruption, this can include recovery from accidental loss of important non-volatile files. This would generally not include .skd, .snp, and .prc files; those can be more easily restored by generating them again. It also can be used to recover from a bad OS patch (which is extremely unlikely). That is easier to manage if the patches were applied just after a disk rotation (see also the Recoverable testing section).

In this case, the shelf disk can be used to restore the system to the state at the time of the most recent rotation. To do this, follow the procedure in Recover from a shelf disk section above. The system can be used for operations once the RAID is recovering for the first refresh in the procedure. All needed volatile operational files that were created/modified after the last disk rotation will need to be recreated. Then as time allows, the other disk can recovered by finishing the procedure in Recover from a shelf disk section.

If the first disk that is tried for blanking and recovery doesn’t work, the other one can be tried. If neither works, it should be possible to run on just what was the shelf disk until a fuller recovery is possible, probably with replacements for the malfunctioning disks.

This approach could also be used for a similar problem with the spare computer and using its shelf disk for recovery.

This approach of this section should not be used if a problem with the operational computer caused the damage to its RAID. In that case, follow the Operational computer RAID corrupted and operational computer failure sub-section below.

9.4. Operational computer RAID corrupted and operational computer failure

This might happen if the operational computer is exposed to fire and/or water. In this case, there are two options. One is switching to using the spare computer as in the Loss of operational computer and all its disks sub-section below. The other is to use the operational computer’s shelf disk in the spare computer, either by itself or by making a ersatz RAID by blanking the spare computer’s shelf disk and refreshing it from the operational computer’s shelf disk.

In the latter scenario, be sure to preserve the original working RAID from the spare computer. All needed volatile operational files that were created/modified after the last operational computer disk rotation will need to be recreated. It will be necessary to move connections for any serial/GPIB devices to the spare computer as well. However, it will not be necessary to enable any daemon’s like metserver and metclient as it would be in the former scenario; this may be a significant time saver.

9.5. Loss of all operational computer disks

If the RAID and shelf disk on the operational computer are beyond recovery, the RAID pair from the spare computer can be moved to the operational computer. All needed volatile operational files that were created/modified after the last refresh_spare_usr2 will need to be recreated. If daemons like metserver and metclient are needed, they will need to be enabled.

This approach should not be used if a problem with the operational computer caused the damage to its RAID. In that case, follow the Operational computer RAID corrupted and operational computer failure sub-section above.

9.6. Loss of operational computer and all its disks

In this case, operations should be moved to the spare computer until the operational computer is repaired or replaced. It will be necessary to move connections for any serial/GPIB devices to the spare computer as well. If daemons like metserver and metclient are needed, they will need to be enabled. All needed volatile operational files that were created/modified after the last refresh_spare_usr2 will need to be recreated.