7 Jan 2017on computers

2016 RAIDed My Life

In 2013, I built a RAID5 array with 8 drives in my file server. Unfortunately, due to a configuration error on my part, I was never notified that, on December 4th, 2016, one of my drives had suffered read errors and was subsequently removed from the array. Even more unfortunately, on December 31, 2016, the degraded array suffered another read error. This left me with two failing drives and a busted RAID5 array. Happy New Year!

2016 has now claimed (what I hope is) its final victim: My RAID array. Should've used RAID6 instead of RAID5... :(
— Alexander Taylor (@fuzyll) December 31, 2016

Here's how the RAID array looked after removing the /etc/fstab entry and rebooting:

root@coruscant:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : inactive sdg[2](S) sdf[4](S) sdi[5](S) sdh[8](S) sda[0](S) sdd[3](S) sdb[6](S) sdc[1](S)
     23441084096 blocks super 1.2

unused devices: <none>

root@coruscant:~# mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
     Raid Level : raid0
  Total Devices : 8
    Persistence : Superblock is persistent

          State : inactive

           Name : coruscant:0  (local to host coruscant)
           UUID : 0866e5be:31de914d:65ee18a1:036badca
         Events : 35399

    Number   Major   Minor   RaidDevice

       -       8        0        -        /dev/sda
       -       8       16        -        /dev/sdb
       -       8       32        -        /dev/sdc
       -       8       48        -        /dev/sdd
       -       8       80        -        /dev/sdf
       -       8       96        -        /dev/sdg
       -       8      112        -        /dev/sdh
       -       8      128        -        /dev/sdi

Thanks to smartctl -t short /dev/sd[abcdfghi] and for disk in /dev/sd[abcdfghi]; do echo "#### $disk ####"; smartctl -l selftest $disk; done, I determined that the drives currently named /dev/sdh and /dev/sdi were the failing drives. In the RAID array details above, you can see the current number of events in the array is 35399. Here's the mdadm --examine output for the two failing drives:

root@coruscant:~# mdadm --examine /dev/sdh
/dev/sdh:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 0866e5be:31de914d:65ee18a1:036badca
           Name : coruscant:0  (local to host coruscant)
  Creation Time : Sun Apr 14 18:27:58 2013
     Raid Level : raid5
   Raid Devices : 8

 Avail Dev Size : 5860271024 (2794.40 GiB 3000.46 GB)
     Array Size : 20510945280 (19560.76 GiB 21003.21 GB)
  Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262064 sectors, after=944 sectors
          State : clean
    Device UUID : 3b9f1bfc:289580bd:a2944f03:bfe31242

    Update Time : Sun Dec  4 09:36:43 2016
       Checksum : 6882990e - correct
         Events : 21191

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 7
   Array State : AAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)

root@coruscant:~# mdadm --examine /dev/sdi
/dev/sdi:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 0866e5be:31de914d:65ee18a1:036badca
           Name : coruscant:0  (local to host coruscant)
  Creation Time : Sun Apr 14 18:27:58 2013
     Raid Level : raid5
   Raid Devices : 8

 Avail Dev Size : 5860271024 (2794.40 GiB 3000.46 GB)
     Array Size : 20510945280 (19560.76 GiB 21003.21 GB)
  Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262064 sectors, after=944 sectors
          State : active
    Device UUID : 1ccdb1b2:2734474a:ae4580d2:059c1db9

    Update Time : Fri Dec 30 23:25:57 2016
       Checksum : f23f1009 - correct
         Events : 35394

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 5
   Array State : AAAAAAA. ('A' == active, '.' == missing, 'R' == replacing)

The first drive, /dev/sdh, was clearly a lost cause at this point: It's over 9,000 events behind! The second drive, /dev/sdi, was only 5 events (and 5 minutes) behind. This is the drive I focused on rescuing.

After ordering a brand new hard drive of the same type (3TB Western Digital Red), I used ddrescue to copy the entire contents of /dev/sdi to the replacement drive (now /dev/sdh after swapping the other bad drive out for the new one):

root@coruscant:~# ddrescue -f /dev/sdi /dev/sdh rescue.log
rescued:     3000 GB,  errsize:   37376 B,  current rate:      512 B/s
   ipos:     2541 GB,   errors:      16,    average rate:   51464 kB/s
   opos:     2541 GB, run time:   16.19 h,  successful read:       0 s ago
Finished

The output from mdadm --examine on this new drive is identical to the failing drive it is replacing. According to a number of sources on the internet, plus the mdadm manpage, I should've been able to use the --force flag to re-assemble the array despite the slight discrepancy in events. Presumably, given there's 5 missing events, I'd only have a few corrupt files (if any) as a result.

Unfortunately, re-assembling the array did not want to work:

root@coruscant:~# mdadm -v --assemble --force --run --scan
mdadm: looking for devices for /dev/md/0
mdadm: no RAID superblock on /dev/sde3
mdadm: no RAID superblock on /dev/sde2
mdadm: no RAID superblock on /dev/sde1
mdadm: no RAID superblock on /dev/sde
mdadm: /dev/sdg is identified as a member of /dev/md/0, slot 2.
mdadm: /dev/sdh is identified as a member of /dev/md/0, slot 5.
mdadm: /dev/sdf is identified as a member of /dev/md/0, slot 4.
mdadm: /dev/sdd is identified as a member of /dev/md/0, slot 3.
mdadm: /dev/sdc is identified as a member of /dev/md/0, slot 1.
mdadm: /dev/sdb is identified as a member of /dev/md/0, slot 6.
mdadm: /dev/sda is identified as a member of /dev/md/0, slot 0.
mdadm: added /dev/sdc to /dev/md/0 as 1
mdadm: added /dev/sdg to /dev/md/0 as 2
mdadm: added /dev/sdd to /dev/md/0 as 3
mdadm: added /dev/sdf to /dev/md/0 as 4
mdadm: added /dev/sdh to /dev/md/0 as 5 (possibly out of date)
mdadm: added /dev/sdb to /dev/md/0 as 6
mdadm: added /dev/sda to /dev/md/0 as 0
mdadm: failed to RUN_ARRAY /dev/md/0: Input/output error
mdadm: Not enough devices to start the array.

root@coruscant:~# dmesg | tail -n 41
[  177.348320] md: bind<sdc>
[  177.348541] md: bind<sdg>
[  177.348736] md: bind<sdd>
[  177.349010] md: bind<sdf>
[  177.349200] md: bind<sdh>
[  177.349349] md: bind<sdb>
[  177.349491] md: bind<sda>
[  177.349534] md: kicking non-fresh sdh from array!
[  177.349542] md: unbind<sdh>
[  177.370961] md: export_rdev(sdh)
[  177.372364] md/raid:md0: device sda operational as raid disk 0
[  177.372369] md/raid:md0: device sdb operational as raid disk 6
[  177.372371] md/raid:md0: device sdf operational as raid disk 4
[  177.372373] md/raid:md0: device sdd operational as raid disk 3
[  177.372374] md/raid:md0: device sdg operational as raid disk 2
[  177.372376] md/raid:md0: device sdc operational as raid disk 1
[  177.373063] md/raid:md0: allocated 8606kB
[  177.373150] md/raid:md0: not enough operational devices (2/8 failed)
[  177.373214] RAID conf printout:
[  177.373219]  --- level:5 rd:8 wd:6
[  177.373221]  disk 0, o:1, dev:sda
[  177.373223]  disk 1, o:1, dev:sdc
[  177.373225]  disk 2, o:1, dev:sdg
[  177.373226]  disk 3, o:1, dev:sdd
[  177.373228]  disk 4, o:1, dev:sdf
[  177.373230]  disk 6, o:1, dev:sdb
[  177.373753] md/raid:md0: failed to run raid set.
[  177.373784] md: pers->run() failed ...
[  177.373840] md: md0 stopped.
[  177.373845] md: unbind<sda>
[  177.383129] md: export_rdev(sda)
[  177.383145] md: unbind<sdb>
[  177.407022] md: export_rdev(sdb)
[  177.407039] md: unbind<sdf>
[  177.419010] md: export_rdev(sdf)
[  177.419029] md: unbind<sdd>
[  177.431006] md: export_rdev(sdd)
[  177.431023] md: unbind<sdg>
[  177.443006] md: export_rdev(sdg)
[  177.443022] md: unbind<sdc>
[  177.455007] md: export_rdev(sdc)

So, I backed up the superblocks from each drive...

root@coruscant:~# mdadm --misc --dump=drives /dev/sd[abcdfgh]
/dev/sda saved as drives/sda.
/dev/sda also saved as drives/wwn-0x50014ee60339f23a.
/dev/sda also saved as drives/ata-WDC_WD30EFRX-68AX9N0_WD-WMC1T2847602.
/dev/sdb saved as drives/sdb.
/dev/sdb also saved as drives/wwn-0x50014ee6adc4ee09.
/dev/sdb also saved as drives/ata-WDC_WD30EFRX-68AX9N0_WD-WMC1T1998491.
/dev/sdc saved as drives/sdc.
/dev/sdc also saved as drives/wwn-0x50014ee60326b419.
/dev/sdc also saved as drives/ata-WDC_WD30EFRX-68AX9N0_WD-WMC1T2767772.
/dev/sdd saved as drives/sdd.
/dev/sdd also saved as drives/wwn-0x50014ee6add8673f.
/dev/sdd also saved as drives/ata-WDC_WD30EFRX-68AX9N0_WD-WMC1T2617790.
/dev/sdf saved as drives/sdf.
/dev/sdf also saved as drives/wwn-0x50014ee60339f3a4.
/dev/sdf also saved as drives/ata-WDC_WD30EFRX-68AX9N0_WD-WMC1T3358756.
/dev/sdg saved as drives/sdg.
/dev/sdg also saved as drives/wwn-0x50014ee003763c8d.
/dev/sdg also saved as drives/ata-WDC_WD30EFRX-68AX9N0_WD-WMC1T1095358.
/dev/sdh saved as drives/sdh.
/dev/sdh also saved as drives/wwn-0x50014ee20db25404.
/dev/sdh also saved as drives/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N4HYL0J2.

...and I took the only remaining option I felt I had left: Re-creating the RAID5 array entirely.

In hindsight, I'm not sure I needed to do this. I'm pretty sure I could have edited the utime and events fields from the dumped superblock above to match the other drives (according to how they're laid out on this page), used mdadm --misc --restore to put it back, attempted --assemble, edited the superblock again with the correct sb_csum, restored again, and then assembled again successfully. But, I figured I'd try the supported destructive solution before the unsupported one...

It's extremely important for this part that the array is re-created with the same exact settings as the original. The output from the failed --assemble above told me the correct order of the drives and the output from --detail above told me the correct chunk size and everything else. So, I specified all of the correct parameters in the correct order to mdadm --create (making sure to mark the drive I'd removed entirely as missing in the slot for the old /dev/sdh):

NOTE: If you are following this because you're trying to resurrect your own dead RAID array, this is not meant to be an all-inclusive guide. I am not responsible for what the following will do to your data if you screw this up. You have been warned (and should probably be posting on a forum or asking someone in IRC at this point).

mdadm -v --create /dev/md0 --chunk=512 --level=5 --raid-devices=8 /dev/sda /dev/sdc /dev/sdg /dev/sdd /dev/sdf /dev/sdh /dev/sdb missing --assume-clean
mdadm: layout defaults to left-symmetric
mdadm: layout defaults to left-symmetric
mdadm: /dev/sda appears to be part of a raid array:
       level=raid5 devices=8 ctime=Sun Apr 14 18:27:58 2013
mdadm: /dev/sdc appears to be part of a raid array:
       level=raid5 devices=8 ctime=Sun Apr 14 18:27:58 2013
mdadm: /dev/sdg appears to be part of a raid array:
       level=raid5 devices=8 ctime=Sun Apr 14 18:27:58 2013
mdadm: /dev/sdd appears to be part of a raid array:
       level=raid5 devices=8 ctime=Sun Apr 14 18:27:58 2013
mdadm: /dev/sdf appears to be part of a raid array:
       level=raid5 devices=8 ctime=Sun Apr 14 18:27:58 2013
mdadm: /dev/sdh appears to be part of a raid array:
       level=raid5 devices=8 ctime=Sun Apr 14 18:27:58 2013
mdadm: /dev/sdb appears to be part of a raid array:
       level=raid5 devices=8 ctime=Sun Apr 14 18:27:58 2013
mdadm: size set to 2930135040K
mdadm: automatically enabling write-intent bitmap on large array
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.

root@coruscant:~# fsck /dev/md0
fsck from util-linux 2.27.1
e2fsck 1.42.13 (17-May-2015)
/dev/md0: clean, 266393/320485376 files, 1231997624/5127736320 blocks

root@coruscant:~# mdadm -D /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Fri Jan  6 21:05:42 2017
     Raid Level : raid5
     Array Size : 20510945280 (19560.76 GiB 21003.21 GB)
  Used Dev Size : 2930135040 (2794.39 GiB 3000.46 GB)
   Raid Devices : 8
  Total Devices : 7
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Fri Jan  6 21:06:43 2017
          State : clean, degraded
 Active Devices : 7
Working Devices : 7
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : coruscant:0  (local to host coruscant)
           UUID : 7b342406:4f8c41c2:d66d2452:03f3c67c
         Events : 8

    Number   Major   Minor   RaidDevice State
       0       8        0        0      active sync   /dev/sda
       1       8       32        1      active sync   /dev/sdc
       2       8       96        2      active sync   /dev/sdg
       3       8       48        3      active sync   /dev/sdd
       4       8       80        4      active sync   /dev/sdf
       5       8      112        5      active sync   /dev/sdh
       6       8       16        6      active sync   /dev/sdb
      14       0        0       14      removed

root@coruscant:~# mount /dev/md0 /store

Success! It worked! Still not sure why using --force with --assemble didn't. I also have no idea why mdadm thinks there is a removed drive in slot 14, rather than 7. Whatever. My data appears to all be here and the array is clean (even if it's still degraded).

The next thing I did was to pop in a new drive to cover the other failing one I'd removed. This drive became the new /dev/sdi. I added it to the array and it began re-building automatically:

root@coruscant:~# mdadm --manage /dev/md0 --add /dev/sdi
mdadm: added /dev/sdi

root@coruscant:~# mdadm -D /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Fri Jan  6 21:05:42 2017
     Raid Level : raid5
     Array Size : 20510945280 (19560.76 GiB 21003.21 GB)
  Used Dev Size : 2930135040 (2794.39 GiB 3000.46 GB)
   Raid Devices : 8
  Total Devices : 8
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Fri Jan  6 21:17:16 2017
          State : clean, degraded, recovering
 Active Devices : 7
Working Devices : 8
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 512K

 Rebuild Status : 0% complete

           Name : coruscant:0  (local to host coruscant)
           UUID : 7b342406:4f8c41c2:d66d2452:03f3c67c
         Events : 11

    Number   Major   Minor   RaidDevice State
       0       8        0        0      active sync   /dev/sda
       1       8       32        1      active sync   /dev/sdc
       2       8       96        2      active sync   /dev/sdg
       3       8       48        3      active sync   /dev/sdd
       4       8       80        4      active sync   /dev/sdf
       5       8      112        5      active sync   /dev/sdh
       6       8       16        6      active sync   /dev/sdb
       8       8      128        7      spare rebuilding   /dev/sdi

root@coruscant:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdi[8] sdb[6] sdh[5] sdf[4] sdd[3] sdg[2] sdc[1] sda[0]
      20510945280 blocks super 1.2 level 5, 512k chunk, algorithm 2 [8/7] [UUUUUUU_]
      [>....................]  recovery =  0.0% (1911964/2930135040) finish=331.8min speed=147074K/sec
      bitmap: 8/22 pages [32KB], 65536KB chunk

unused devices: <none>

I ran watch cat /proc/mdstat to track its progress. When it was done, I ran mdadm --detail --scan and used the output to replace the old line in /etc/mdadm/mdadm.conf. This was necessary since I now have an entirely new RAID array (you can see the creation time and UUID are different in the output above). I also uncommented the entry in /etc/fstab (according to blkid, the UUID for the filesystem hadn't changed) and did a reboot just to be sure everything would come back up correctly.

In the end, this all had a happy ending. I lost access to all my files for about a week and had to spend money I was supposed to be using for the upcoming Nintendo Switch, but I now have the experience of having recovered a RAID array! I've now got both mdadmd and smartd configured properly with e-mail notifications as well, so hopefully I won't have to deal with this again.