Topic: Recovering from RAID1 drive failure (Read 16723 times)

m2k3423 · « **on:** March 13, 2010, 04:12:28 PM »

Hello,

This is to share an ordeal in trying to get the RAID1 back without the risk of loosing all data in the surviving drive.

One of the drive in a RAID1 configured DNS323 failed. The failed drive is a Seagate 1TB with firmware SD15, which is known to die 'expectedly'. I had firmware 1.06 in the DNS323 originally, replaced the failed drive with a Western Digital Caviar Green WD10EARS 1TB with 32MB, the configuration wizard guided me to format the drive, however, it got stuck at 94% completion mark (found out later that this is widely reported in the forum). After leaving it in that stage for hours (>2hrs), the progress was sill 94%, so had to forced power recycling because the web configuration was no longer responsive. After the unit booted up, getting back into the web configuration, the wizard once again asked that the driveS to be formatted to be part of RAID1 array. I clicked "Skip" (This step is crucial, by proceeding, the whole RAID1 will likely be damaged). Then, going to "Tools", "RAID1", I started manually "Re-build", which re-sync the drives. After many hours, the sync was reported completed. However, upon re-booting the DNS323, the web configuration wizard was again reporting that the new drive needed to be formatted, again, click "Skip". Checked the "Status" page, and it reported RAID1 array was in-sync'ed. .

I googled for similar reports, and decided to flash in 1.08 firmware. However, the behavior is the same, though the firmware release note did mentioned that the "stuck at 94%" problem has been resolved. I think this was because the problem is already triggered in my case, had I flash in 1.08 before replacing the drive, it may have avoided this, I suppose.

So, this is a bug that the rest of this post is about, however, it require familiarity with Linux and all the steps necessary to get funplug in the DNS323. So, do not proceed if you are not comfortable with those.

First, let me zero in on the problem - Despite that the internal Linux RAID sub-system was very happy about the health status of the RAID1 array, however, the web configuration wizards depended on other DNS323 proprietary data file to track the state of the RAID1 array, and "status" page depended on query to the Linux /proc/mdstat to establish the health of the array. These different approaches in establishing the state of RAID array caused the ambiguous information.

The internal RAID state tracking data file appeared to be hd_magic_num, kept in /dev/mtdblock0 an /dev/mtdblock1 (both are minix filesystem). The file is copied out to /mnt/HD_xx/.systemfile/ (i.e. HD_a2, HD_a4 and HD_b4 typically). The format appears to be two random tokens followed by the serial numbers of the right and left drive, in that order. The random tokens are probably meant for consistency verification of all copies scattered around, it probably allow for reusing the same two drives for building RAID1 from scratch if they do not match. For a degraded RAID case with a replacement drive, due to the "stuck at 94%" problem, these files were never updated with the correct information.

Therefore, funplug is required for getting into the box, and manually fixing these files. Assuming you either enter the box via telnet or ssh (dropbear), the following need to be done:

1. Mount /dev/mtdblock0: mount -t minix /dev/mtdblock0 /sys/mtd1
2. Goto to /sys/mtd1: cd /sys/mtd1
3. Make a backup copy of hd_magic_num: cp hd_magic_num hd_magic_num.old
4. Edit hd_magic_num: vi hd_magic_num
5. Change the first two numbers to any number of choice (32 bit integer in decimal).
6. Change the 3rd line to the serial number of the right drive.
7. Change the 4th line to the serial number of the left drive.
8. Exit vi.
9. Check that all information is correct carefully.
10. Un-mount /dev/mtdblock0: umount /sys/mtd1. (this step is crucial).

11. Mount /dev/mtdblock1: mount -t minix /dev/mtdblock1 /sys/mtd2
12. Goto to /sys/mtd2: cd /sys/mtd2
13. Make a backup copy of hd_magic_num: cp hd_magic_num hd_magic_num.old
14. Edit hd_magic_num: vi hd_magic_num
15. Change the first two numbers to any number of choice (32 bit integer in decimal).
16. Change the 3rd line to the serial number of the right drive.
17. Change the 4th line to the serial number of the left drive.
18. Exit vi.
19. Check that all information is correct carefully.
20. Copy hd_magic_num to copies in hard drives:

cp hd_magic_num /mnt/HD_a2/.systemfile
cp hd_magic_num /mnt/HD_a4/.systemfile
cp hd_magic_num /mnt/HD_b4/.systemfile

21. Un-mount /dev/mtdblock1: umount /sys/mtd2. (this step is crucial).
22. Re-start the unit and verify that the web configuration wizard do not ask to format the drive again.

Alternative way to get the RAID1 array rebuild

I had experimented with the following way and found it equally workable, but require familiarity with funplug and Linux:

1. Get into the box via telnet or ssh.
2. Manually partition the replacement drive to match the surviving drive using fdisk.
3. Manually rebuild the RAID1: /usr/sbin/mdadm --manage --add /dev/md0 /dev/sdx2
4. Check /proc/mdstat for re-sync status.
5. When re-sync is done, update hd_magic_num as described above.

This saga once again reaffirm my trust in DNS323 because it uses Linux and therefore, there are many ways where a user interface failure can be work-around.

One more thing, if you have a Linux box that has a spare SATA slot, it may be worthy to pull out the surviving drive and slot into the box to get the drive a health check. For example, have Fedora (recommend 9 and above), use "Disk Utility" to read the SMART status. What you should look at is the "relocated bad sector count" and "pending bad sector". Any of these reading other than 0 means that the drive will fail pretty soon. If you have a 1TB Seagate Barracuda, you can also check the firmware release, and if so, download the bootable ISO image to update the firmware to SD1A this way.

Good luck and hope this post will never be of any use to you. :-)

gunrunnerjohn · « **Reply #1 on:** March 14, 2010, 06:44:50 AM »

Of course, knowing that RAID of any level is not backup would be a good thing to remember as well. RAID-1 is strictly for maximum availability of the data, it's not intended to be backup.

m2k3423 · « **Reply #2 on:** March 15, 2010, 07:41:10 PM »

Agree, the key issue with any RAID configure NAS is "AVAILABILITY".

Therefore, I hope DLINK folks are listening - whenever a RAID drive goes into degraded state and hitting the "stuck at 94%" problem in the process of recovering from degraded state means "no availability" !

The thing that make this problem irritating is that the underlying Linux has happily rebuild the RAID, and it is the web user-interface that is ignorant about the real state of the RAID array and insisting on re-formatting the RAID drives. If it were years ago when drives are few hundred megabyte, this may not been a big issue, but in this tera-byte era, formatting and re-sync is the very last thing one would want to see RAID array fall into. It hurts availability real bad!

So, to me, whether RAID is for backup or not is too rhetoric, we are not in communist era no need for slogan. RAID array should keep data safely, make them available as much time as possible, create as few problems as possible, when it falls, get back up reliably and quickly. All DLINK should do is fix the damn web user-interface bug and get on with life.

fordem · « **Reply #3 on:** March 16, 2010, 05:17:41 AM »

Quote from: m2k3423 on March 15, 2010, 07:41:10 PM

So, to me, whether RAID is for backup or not is too rhetoric, we are not in communist era no need for slogan

Forget communism and slogans for the while - it's nothing so sinister.

Many of the less experienced users do not understand the technology or it's intended usage - search these and other NAS related forums and you will find many crying out in frustration after losing precious data that they thought was safe because it was stored in a RAID array - I created my signature one day when I got tired of explaining how RAID should be used.

Consider it a "public service announcement" - I'm sure gunrunnerjohn has a similar story.

gunrunnerjohn · « **Reply #4 on:** March 16, 2010, 05:21:40 AM »

If the RAID array has really been rebuilt, does rebooting the NAS bring it back to it's senses?

m2k3423 · « **Reply #5 on:** March 17, 2010, 10:14:23 PM »

Nope, tried re-booting many times before resorting to the manual editing.

The key problem is that the web configuration server scripts/procedures executed by 'goweb' is depending on a different way in determining the RAID array status as compared to the "status" page. IMHO, the most reliable way of determining the state of RAID array is to depending on 'mdadm' or /proc/mdstat.

hd_magic_num do serve a purpose to detect changes in drives inserted in the slots. IMHO, this should trigger a warning page to tell user that the drive configuration has changed, displaying the last known drives that constitute the RAID array, and the new drive or drives that are detected in the slots, and display the RAID UUID for previously known drive and the new drive detected. The user should then be given choice to: 1) Skip (assuming all is well and okay) 2) Select which new drive replaces which drive (thereby initializing the drive, i.e. erase any previous RAID signature, re-partition, re-sync if auto-re-sync is enabled). 3) Forget about the previous RAID array, and re-build RAID array from scratch (this allow putting in new RAID drive pairs, or reverting the drives to JBOD or individual volume).

One more point: since 1.08 support ext3 now, there should be an option under tools to allow converting existing ext2 volume to ext3 (tune2fs -j /dev/sdx2), after warning about need to backup data and no power interruption. IMHO, with drives TB in size, the risk of getting a corrupted ext2 from a power glitch is far too risky to stomach. Again, without the web configuration option, I had to manually edit files to get the RAID to mount as ext3, after pulling out the RAID drives to a Linux machine to perform the conversion.

RAID is just like buying insurance policy, it give peace of mind that when something go bad, there is another avenue to fall-back on to make a bad day a little better. Having being insured does not mean one can be reckless in taking risks. So, having data in RAID array does not mean no need to backup. Also, insurance policy, it better work when you most need it, so, in this light, when a drive goes down, the recovery of RAID array with a new drive better work easily, or it drives people nuts.

Please do not get me wrong, have absolutely no intention to offend. I was just frustrated that a simple web configuration bug is putting some black marks on a otherwise great product.

gunrunnerjohn · « **Reply #6 on:** March 18, 2010, 05:38:39 AM »

Quote from: m2k3423 on March 17, 2010, 10:14:23 PM

RAID is just like buying insurance policy, it give peace of mind that when something go bad, there is another avenue to fall-back on to make a bad day a little better. Having being insured does not mean one can be reckless in taking risks. So, having data in RAID array does not mean no need to backup. Also, insurance policy, it better work when you most need it, so, in this light, when a drive goes down, the recovery of RAID array with a new drive better work easily, or it drives people nuts.

A number of us keep trying to bang home the concept that RAID of any level is not backup, but the message falls on deaf ears many times. It's mind numbing to see how many people have one copy of their data and are surprised when it's gone from a hardware failure or even a software glitch.

fordem · « **Reply #7 on:** March 18, 2010, 07:26:45 AM »

Quote from: gunrunnerjohn on March 18, 2010, 05:38:39 AM

A number of us keep trying to bang home the concept that RAID of any level is not backup, but the message falls on deaf ears many times. It's mind numbing to see how many people have one copy of their data and are surprised when it's gone from a hardware failure or even a software glitch.

Let's forget about hardware failure & software glitches - and assume perfectly functional hardware - what about good old operator error?

gunrunnerjohn · « **Reply #8 on:** March 18, 2010, 07:32:10 AM »

Operator glitch?

Very true, I should have been more complete in my description.

mesostinky · « **Reply #9 on:** March 30, 2010, 04:31:26 PM »

So if a drive fails in while your using RAID 1 on the dns-323 your not just going to able to pop in a drive and tell the 323 to format and then rebuild the mirror?

Is this post accurate about the 323 having a bug in the web utility that prevents everyone from properly rebuilding the mirror after a drive fails? Is this dependent on firmware?

I'm very curious because I just bought two 1TB drives and need to know the correct procedure if one of them fails in my DNS-323. Any yes my DNS-322 is not my backup source. It would just be easier to pop the drive in and hit rebuild vs wipe everything and start from scratch. Thanks for any info.

fordem · « **Reply #10 on:** March 30, 2010, 08:46:48 PM »

Quote from: mesostinky on March 30, 2010, 04:31:26 PM

So if a drive fails in while your using RAID 1 on the dns-323 your not just going to able to pop in a drive and tell the 323 to format and then rebuild the mirror?

Is this post accurate about the 323 having a bug in the web utility that prevents everyone from properly rebuilding the mirror after a drive fails? Is this dependent on firmware?

I'm very curious because I just bought two 1TB drives and need to know the correct procedure if one of them fails in my DNS-323. Any yes my DNS-322 is not my backup source. It would just be easier to pop the drive in and hit rebuild vs wipe everything and start from scratch. Thanks for any info.

The correct procedure is to power down the DNS-323, remove the failed drive and replace it with a new one, power the DNS-323 on, login to the web admin page and follow the prompts - the unit should format the new disk and then sync them.

I would also recommend a backup before you start this procedure, if you do not already have one.

In three years I have never had a disk fail, but I have simulated failures and it has successfully rebuilt the array on every occasion.

m2k3423 · « **Reply #11 on:** March 30, 2010, 10:16:11 PM »

For 1TB RAID array, if your firmware is pre 1.08, you may run into the "94% problem" - that is, the web configuration will stall at 94% when the re-build is happening. For 1.08, according to the release note, this problem is supposedly solved. However, I have not tried.

For smaller drives, say 650GB, I had previously recover from degraded array using the web configuration menu. I believe the firmware was 1.05 at that time.

Recently, after one of the 1TB drive failed, I had some problems getting the degraded RAID1 up because the firmware was initially 1.06. This was the incident describe in this thread. After upgrading the firmware to 1.08, it still did not get the RAID1 up. I had to manually mess with some internal configuration files and manage to recover the RAID1 array to full health.

After the above said DNS323 saga, I upgraded two other DNS323 RAID1 array (from 650GB to 1TB, from 1TB to 1.5TB). For these two cases, I did not use the web configuration way, but resorted to pulling out the drives to a Linux machine to get the job done. After which, reinserted the drives and DNS323 happily accepted them.

Some people were lucky and got the degraded array up using the web configuration menu. For some, and good number of others (google DNS323 94%), we were not that lucky. However, that is pre 1.08. So, it is a long answer to your question, but if you can afford to loose all data in the RAID1 array anytime, it really does not matter that much.

For me, avoiding rsync'ing GB of data around is THE goal.

mesostinky · « **Reply #12 on:** March 31, 2010, 12:50:46 PM »

Thanks guys, great answers. I'm on 1.06 with a single 500GB drive but will update that to 1.08 first, do a factory reset, install the new drives, then restore the data.

Wiggs · « **Reply #13 on:** April 01, 2010, 03:55:18 AM »

As already stated, the correct procedure to recover from a RAID drive failure is to BACKUP first and then go through the process of replacing and rebuilding the array. I also have had to rebuild the RAID array and the DNS has done it perfectly, however I had a backup of the data from the good drive just in case. If you do not have a backup prior to attempting to rebuild the array, there is always a possibility that gold 'ol Murphy's Law will apply!

Regards,

Wiggs

m2k3423 · « **Reply #14 on:** April 05, 2010, 06:14:35 AM »

Hi, I just helped a friend to recover DNS-323 from degraded RAID1, and had bumped into the "94% formatting stall" problem. In the process, I found a simpler method, but this will require telnet into the DNS-323:

Assuming you had hit the "94% formatting stall" problem, and had waited at enough time. Then telnet into the box.
Check that the partition table is similar (using fdisk)
Add the new drive back into the RAID1 array: mdadm --manage --add /dev/md0 /dev/sd?2 (? being either a or b, depending on which drive is the replacement drive
Wait for RAID1 to finish re-sync. This may takes hours, depending on your capacity.
Check /proc/mdstat to make sure RAID1 array is in sync. Then, issue this command /usr/sbin/hd_verify -w. This will force update of the hd_magic_num
Power down from the front panel or via the web configuration menu.
Power up and log on to the web configuration menu, and verify that it no longer ask you to format your drive or re-build your RAID1

D-Link Forums

News:

Author Topic: Recovering from RAID1 drive failure (Read 16723 times)

m2k3423

Recovering from RAID1 drive failure

gunrunnerjohn

Re: Recovering from RAID1 drive failure

m2k3423

Re: Recovering from RAID1 drive failure

fordem

Re: Recovering from RAID1 drive failure

gunrunnerjohn

Re: Recovering from RAID1 drive failure

m2k3423

Re: Recovering from RAID1 drive failure

gunrunnerjohn

Re: Recovering from RAID1 drive failure

fordem

Re: Recovering from RAID1 drive failure

gunrunnerjohn

Re: Recovering from RAID1 drive failure

mesostinky

Re: Recovering from RAID1 drive failure

fordem

Re: Recovering from RAID1 drive failure

m2k3423

Re: Recovering from RAID1 drive failure

mesostinky

Re: Recovering from RAID1 drive failure

Wiggs

Re: Recovering from RAID1 drive failure

m2k3423

Re: Recovering from RAID1 drive failure