In 2013, I built a RAID5 array with 8 drives in my file server. Unfortunately, due to a configuration error on my part, I was never notified that, on December 4th, 2016, one of my drives had suffered read errors and was subsequently removed from the array. Even more unfortunately, on December 31, 2016, the degraded array suffered another read error. This left me with two failing drives and a busted RAID5 array. Happy New Year!
2016 has now claimed (what I hope is) its final victim: My RAID array. Should've used RAID6 instead of RAID5... :(
— Alexander Taylor (@fuzyll) December 31, 2016
Here's how the RAID array looked after removing the /etc/fstab
entry and rebooting:
root@coruscant:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : inactive sdg[2](S) sdf[4](S) sdi[5](S) sdh[8](S) sda[0](S) sdd[3](S) sdb[6](S) sdc[1](S)
23441084096 blocks super 1.2
unused devices: <none>
root@coruscant:~# mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Raid Level : raid0
Total Devices : 8
Persistence : Superblock is persistent
State : inactive
Name : coruscant:0 (local to host coruscant)
UUID : 0866e5be:31de914d:65ee18a1:036badca
Events : 35399
Number Major Minor RaidDevice
- 8 0 - /dev/sda
- 8 16 - /dev/sdb
- 8 32 - /dev/sdc
- 8 48 - /dev/sdd
- 8 80 - /dev/sdf
- 8 96 - /dev/sdg
- 8 112 - /dev/sdh
- 8 128 - /dev/sdi
Thanks to smartctl -t short /dev/sd[abcdfghi]
and
for disk in /dev/sd[abcdfghi]; do echo "#### $disk ####"; smartctl -l selftest $disk; done
, I determined that the
drives currently named /dev/sdh
and /dev/sdi
were the failing drives. In the RAID array details above, you can
see the current number of events in the array is 35399. Here's the mdadm --examine
output for the two failing drives:
root@coruscant:~# mdadm --examine /dev/sdh
/dev/sdh:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 0866e5be:31de914d:65ee18a1:036badca
Name : coruscant:0 (local to host coruscant)
Creation Time : Sun Apr 14 18:27:58 2013
Raid Level : raid5
Raid Devices : 8
Avail Dev Size : 5860271024 (2794.40 GiB 3000.46 GB)
Array Size : 20510945280 (19560.76 GiB 21003.21 GB)
Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262064 sectors, after=944 sectors
State : clean
Device UUID : 3b9f1bfc:289580bd:a2944f03:bfe31242
Update Time : Sun Dec 4 09:36:43 2016
Checksum : 6882990e - correct
Events : 21191
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 7
Array State : AAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)
root@coruscant:~# mdadm --examine /dev/sdi
/dev/sdi:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 0866e5be:31de914d:65ee18a1:036badca
Name : coruscant:0 (local to host coruscant)
Creation Time : Sun Apr 14 18:27:58 2013
Raid Level : raid5
Raid Devices : 8
Avail Dev Size : 5860271024 (2794.40 GiB 3000.46 GB)
Array Size : 20510945280 (19560.76 GiB 21003.21 GB)
Used Dev Size : 5860270080 (2794.39 GiB 3000.46 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
Unused Space : before=262064 sectors, after=944 sectors
State : active
Device UUID : 1ccdb1b2:2734474a:ae4580d2:059c1db9
Update Time : Fri Dec 30 23:25:57 2016
Checksum : f23f1009 - correct
Events : 35394
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 5
Array State : AAAAAAA. ('A' == active, '.' == missing, 'R' == replacing)
The first drive, /dev/sdh
, was clearly a lost cause at this point: It's over 9,000 events behind! The second drive,
/dev/sdi
, was only 5 events (and 5 minutes) behind. This is the drive I focused on rescuing.
After ordering a brand new hard drive of the same type (3TB Western Digital Red), I used ddrescue
to
copy the entire contents of /dev/sdi
to the replacement drive (now /dev/sdh
after swapping the other bad drive
out for the new one):
root@coruscant:~# ddrescue -f /dev/sdi /dev/sdh rescue.log
rescued: 3000 GB, errsize: 37376 B, current rate: 512 B/s
ipos: 2541 GB, errors: 16, average rate: 51464 kB/s
opos: 2541 GB, run time: 16.19 h, successful read: 0 s ago
Finished
The output from mdadm --examine
on this new drive is identical to the failing drive it is replacing. According to
a number of sources on the internet, plus the mdadm
manpage, I should've been able to use the --force
flag to
re-assemble the array despite the slight discrepancy in events. Presumably, given there's 5 missing events, I'd only
have a few corrupt files (if any) as a result.
Unfortunately, re-assembling the array did not want to work:
root@coruscant:~# mdadm -v --assemble --force --run --scan
mdadm: looking for devices for /dev/md/0
mdadm: no RAID superblock on /dev/sde3
mdadm: no RAID superblock on /dev/sde2
mdadm: no RAID superblock on /dev/sde1
mdadm: no RAID superblock on /dev/sde
mdadm: /dev/sdg is identified as a member of /dev/md/0, slot 2.
mdadm: /dev/sdh is identified as a member of /dev/md/0, slot 5.
mdadm: /dev/sdf is identified as a member of /dev/md/0, slot 4.
mdadm: /dev/sdd is identified as a member of /dev/md/0, slot 3.
mdadm: /dev/sdc is identified as a member of /dev/md/0, slot 1.
mdadm: /dev/sdb is identified as a member of /dev/md/0, slot 6.
mdadm: /dev/sda is identified as a member of /dev/md/0, slot 0.
mdadm: added /dev/sdc to /dev/md/0 as 1
mdadm: added /dev/sdg to /dev/md/0 as 2
mdadm: added /dev/sdd to /dev/md/0 as 3
mdadm: added /dev/sdf to /dev/md/0 as 4
mdadm: added /dev/sdh to /dev/md/0 as 5 (possibly out of date)
mdadm: added /dev/sdb to /dev/md/0 as 6
mdadm: added /dev/sda to /dev/md/0 as 0
mdadm: failed to RUN_ARRAY /dev/md/0: Input/output error
mdadm: Not enough devices to start the array.
root@coruscant:~# dmesg | tail -n 41
[ 177.348320] md: bind<sdc>
[ 177.348541] md: bind<sdg>
[ 177.348736] md: bind<sdd>
[ 177.349010] md: bind<sdf>
[ 177.349200] md: bind<sdh>
[ 177.349349] md: bind<sdb>
[ 177.349491] md: bind<sda>
[ 177.349534] md: kicking non-fresh sdh from array!
[ 177.349542] md: unbind<sdh>
[ 177.370961] md: export_rdev(sdh)
[ 177.372364] md/raid:md0: device sda operational as raid disk 0
[ 177.372369] md/raid:md0: device sdb operational as raid disk 6
[ 177.372371] md/raid:md0: device sdf operational as raid disk 4
[ 177.372373] md/raid:md0: device sdd operational as raid disk 3
[ 177.372374] md/raid:md0: device sdg operational as raid disk 2
[ 177.372376] md/raid:md0: device sdc operational as raid disk 1
[ 177.373063] md/raid:md0: allocated 8606kB
[ 177.373150] md/raid:md0: not enough operational devices (2/8 failed)
[ 177.373214] RAID conf printout:
[ 177.373219] --- level:5 rd:8 wd:6
[ 177.373221] disk 0, o:1, dev:sda
[ 177.373223] disk 1, o:1, dev:sdc
[ 177.373225] disk 2, o:1, dev:sdg
[ 177.373226] disk 3, o:1, dev:sdd
[ 177.373228] disk 4, o:1, dev:sdf
[ 177.373230] disk 6, o:1, dev:sdb
[ 177.373753] md/raid:md0: failed to run raid set.
[ 177.373784] md: pers->run() failed ...
[ 177.373840] md: md0 stopped.
[ 177.373845] md: unbind<sda>
[ 177.383129] md: export_rdev(sda)
[ 177.383145] md: unbind<sdb>
[ 177.407022] md: export_rdev(sdb)
[ 177.407039] md: unbind<sdf>
[ 177.419010] md: export_rdev(sdf)
[ 177.419029] md: unbind<sdd>
[ 177.431006] md: export_rdev(sdd)
[ 177.431023] md: unbind<sdg>
[ 177.443006] md: export_rdev(sdg)
[ 177.443022] md: unbind<sdc>
[ 177.455007] md: export_rdev(sdc)
So, I backed up the superblocks from each drive...
root@coruscant:~# mdadm --misc --dump=drives /dev/sd[abcdfgh]
/dev/sda saved as drives/sda.
/dev/sda also saved as drives/wwn-0x50014ee60339f23a.
/dev/sda also saved as drives/ata-WDC_WD30EFRX-68AX9N0_WD-WMC1T2847602.
/dev/sdb saved as drives/sdb.
/dev/sdb also saved as drives/wwn-0x50014ee6adc4ee09.
/dev/sdb also saved as drives/ata-WDC_WD30EFRX-68AX9N0_WD-WMC1T1998491.
/dev/sdc saved as drives/sdc.
/dev/sdc also saved as drives/wwn-0x50014ee60326b419.
/dev/sdc also saved as drives/ata-WDC_WD30EFRX-68AX9N0_WD-WMC1T2767772.
/dev/sdd saved as drives/sdd.
/dev/sdd also saved as drives/wwn-0x50014ee6add8673f.
/dev/sdd also saved as drives/ata-WDC_WD30EFRX-68AX9N0_WD-WMC1T2617790.
/dev/sdf saved as drives/sdf.
/dev/sdf also saved as drives/wwn-0x50014ee60339f3a4.
/dev/sdf also saved as drives/ata-WDC_WD30EFRX-68AX9N0_WD-WMC1T3358756.
/dev/sdg saved as drives/sdg.
/dev/sdg also saved as drives/wwn-0x50014ee003763c8d.
/dev/sdg also saved as drives/ata-WDC_WD30EFRX-68AX9N0_WD-WMC1T1095358.
/dev/sdh saved as drives/sdh.
/dev/sdh also saved as drives/wwn-0x50014ee20db25404.
/dev/sdh also saved as drives/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N4HYL0J2.
...and I took the only remaining option I felt I had left: Re-creating the RAID5 array entirely.
In hindsight, I'm not sure I needed to do this. I'm pretty sure I could have edited the utime
and events
fields from the dumped superblock above to match the other drives (according to how they're laid out on
this page),
used mdadm --misc --restore
to put it back, attempted --assemble
, edited the superblock again with the correct
sb_csum
, restored again, and then assembled again successfully. But, I figured I'd try the supported destructive
solution before the unsupported one...
It's extremely important for this part that the array is re-created with the same exact settings as the original.
The output from the failed --assemble
above told me the correct order of the drives and the output from --detail
above told me the correct chunk size and everything else. So, I specified all of the correct parameters in the
correct order to mdadm --create
(making sure to mark the drive I'd removed entirely as missing
in the slot for
the old /dev/sdh
):
NOTE: If you are following this because you're trying to resurrect your own dead RAID array, this is not meant to be an all-inclusive guide. I am not responsible for what the following will do to your data if you screw this up. You have been warned (and should probably be posting on a forum or asking someone in IRC at this point).
mdadm -v --create /dev/md0 --chunk=512 --level=5 --raid-devices=8 /dev/sda /dev/sdc /dev/sdg /dev/sdd /dev/sdf /dev/sdh /dev/sdb missing --assume-clean
mdadm: layout defaults to left-symmetric
mdadm: layout defaults to left-symmetric
mdadm: /dev/sda appears to be part of a raid array:
level=raid5 devices=8 ctime=Sun Apr 14 18:27:58 2013
mdadm: /dev/sdc appears to be part of a raid array:
level=raid5 devices=8 ctime=Sun Apr 14 18:27:58 2013
mdadm: /dev/sdg appears to be part of a raid array:
level=raid5 devices=8 ctime=Sun Apr 14 18:27:58 2013
mdadm: /dev/sdd appears to be part of a raid array:
level=raid5 devices=8 ctime=Sun Apr 14 18:27:58 2013
mdadm: /dev/sdf appears to be part of a raid array:
level=raid5 devices=8 ctime=Sun Apr 14 18:27:58 2013
mdadm: /dev/sdh appears to be part of a raid array:
level=raid5 devices=8 ctime=Sun Apr 14 18:27:58 2013
mdadm: /dev/sdb appears to be part of a raid array:
level=raid5 devices=8 ctime=Sun Apr 14 18:27:58 2013
mdadm: size set to 2930135040K
mdadm: automatically enabling write-intent bitmap on large array
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
root@coruscant:~# fsck /dev/md0
fsck from util-linux 2.27.1
e2fsck 1.42.13 (17-May-2015)
/dev/md0: clean, 266393/320485376 files, 1231997624/5127736320 blocks
root@coruscant:~# mdadm -D /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Fri Jan 6 21:05:42 2017
Raid Level : raid5
Array Size : 20510945280 (19560.76 GiB 21003.21 GB)
Used Dev Size : 2930135040 (2794.39 GiB 3000.46 GB)
Raid Devices : 8
Total Devices : 7
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Fri Jan 6 21:06:43 2017
State : clean, degraded
Active Devices : 7
Working Devices : 7
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Name : coruscant:0 (local to host coruscant)
UUID : 7b342406:4f8c41c2:d66d2452:03f3c67c
Events : 8
Number Major Minor RaidDevice State
0 8 0 0 active sync /dev/sda
1 8 32 1 active sync /dev/sdc
2 8 96 2 active sync /dev/sdg
3 8 48 3 active sync /dev/sdd
4 8 80 4 active sync /dev/sdf
5 8 112 5 active sync /dev/sdh
6 8 16 6 active sync /dev/sdb
14 0 0 14 removed
root@coruscant:~# mount /dev/md0 /store
Success! It worked! Still not sure why using --force
with --assemble
didn't. I also have no idea why mdadm
thinks there is a removed drive in slot 14, rather than 7. Whatever. My data appears to all be here and the array
is clean (even if it's still degraded).
The next thing I did was to pop in a new drive to cover the other failing one I'd removed. This drive became the new
/dev/sdi
. I added it to the array and it began re-building automatically:
root@coruscant:~# mdadm --manage /dev/md0 --add /dev/sdi
mdadm: added /dev/sdi
root@coruscant:~# mdadm -D /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Fri Jan 6 21:05:42 2017
Raid Level : raid5
Array Size : 20510945280 (19560.76 GiB 21003.21 GB)
Used Dev Size : 2930135040 (2794.39 GiB 3000.46 GB)
Raid Devices : 8
Total Devices : 8
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Fri Jan 6 21:17:16 2017
State : clean, degraded, recovering
Active Devices : 7
Working Devices : 8
Failed Devices : 0
Spare Devices : 1
Layout : left-symmetric
Chunk Size : 512K
Rebuild Status : 0% complete
Name : coruscant:0 (local to host coruscant)
UUID : 7b342406:4f8c41c2:d66d2452:03f3c67c
Events : 11
Number Major Minor RaidDevice State
0 8 0 0 active sync /dev/sda
1 8 32 1 active sync /dev/sdc
2 8 96 2 active sync /dev/sdg
3 8 48 3 active sync /dev/sdd
4 8 80 4 active sync /dev/sdf
5 8 112 5 active sync /dev/sdh
6 8 16 6 active sync /dev/sdb
8 8 128 7 spare rebuilding /dev/sdi
root@coruscant:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdi[8] sdb[6] sdh[5] sdf[4] sdd[3] sdg[2] sdc[1] sda[0]
20510945280 blocks super 1.2 level 5, 512k chunk, algorithm 2 [8/7] [UUUUUUU_]
[>....................] recovery = 0.0% (1911964/2930135040) finish=331.8min speed=147074K/sec
bitmap: 8/22 pages [32KB], 65536KB chunk
unused devices: <none>
I ran watch cat /proc/mdstat
to track its progress. When it was done, I ran
mdadm --detail --scan
and used the output to replace the old line in /etc/mdadm/mdadm.conf
. This was necessary
since I now have an entirely new RAID array (you can see the creation time and UUID are different in the output
above). I also uncommented the entry in /etc/fstab
(according to blkid
, the UUID for the filesystem hadn't changed)
and did a reboot
just to be sure everything would come back up correctly.
In the end, this all had a happy ending. I lost access to all my files for about a week and had to spend money I
was supposed to be using for the upcoming Nintendo Switch, but I
now have the experience of having recovered a RAID array! I've now got both mdadmd
and smartd
configured properly
with e-mail notifications as well, so hopefully I won't have to deal with this again.