Dell PowerEdge 2850 RAID Failure

About three weeks ago one of my newest servers had a major failure. The server runs a very critical business web application so uptime is very important. For this reason we configured a very reliable server.

Dell PowerEdge 2850
Dual Xeon 2.8GHz CPU
2GB ECC DDR-SDRAM
RAID 5 LSI RAID Controller
6 x 73GB Maxtor SCSI Disks

Below is the complete saga with Dell broken down.

On 4/27/2006 around 10AM we attempted to log into our clients production web server which hosts their critical business application. Our logins were successful, but we were immediately being bumped back to the login screen. We checked the servers drive state and noticed it had rejected a hard drive from the array. To resolve the problem we attempted to restart the server, however it was unable to fully boot into Windows. FAIL at this point the server is completely unresponsive and will not bring up a login screen.

After some research we found the RAID control was corrupting the data which was being written to the disk. This had been going on for some time, and appears to have corrupted the winlogon.exe.

Within a short amount of time we were able to bring their corporate website back online on the backup server. However their business application took significantly longer to bring back up because of the frequently changing data. To retrieve the live data off the server we booted using a restore tool, and copies the files / database onto a spare hard drive. This was to ensure we had the 100% most recent version of data from any morning transaction. We were able to bring everything back online by 3:30PM on the backup server.

We immediately called Dell who recommended we upgrade the firmware server RAID controller. Dell pointed out that this specific machine had shipped with a firmware which had known problems. First thing in the morning on 4/28/2006 we upgraded the firmware to the recommend version. We also upgraded the motherboard BIOS firmware as recommended by Dell. After letting the server run we scheduled a turn-up for Tuesday May 2nd. This was intended to give the server time to burn-in and ensure the firmware fixed the problem. We also had to completely reinstall Windows 2003 Server + MS SQL server 2000 to bring the server back online.

On 05/02/2006 at 8PM we attempted to bring all the data back over to the production machine. Immediately we checked the servers drive state and noticed it had rejected another hard drive from the array. This is a sign that the problem was not fixed from our firmware update.

The next morning I called Dell back and they shipped overnight a new RAID “Key” chip, controller card memory, and a new backplane for the drives to mount into. We replaced all of this equipment and let the server “burn-in” to ensure this fix would work. We scheduled another launch date for 05/09/2006 after letting the server run over the weekend. At 7PM we meet at the office and moved the site files and database back over. At approximately 9PM, after a final reboot we noticed the server rejected another hard drive. To be safe we immediately moved the site back to the backup server.

On 05/16/2006 after heavy lobbying Dell shipped a new server which seems to have resolved the problem. After further inspection I noticed they changed SCSI hard disk vendors. It’s my theory there is something wrong between MAXTOR + LSI RAID, but at this point I cannot prove anything. The replacement Seagate’s seem to resolve the problem.

This is intended to be a heads up for anyone dealing with the same issue. Level 1 Dell server support seemed to have failed us here, however once the problem was escalated they took action quickly to ship a new server.

2 Comments »

  1. Comment by steve

    Thanks! I’m having the same issue. This helped me quite a bit.

  2. Comment by Nadeem

    arrrggggg

    Had the same issue and wish i saw your post before as it took a long time on the phone with Dell to convince them to send a new server out asap

    If they know of the Maxtor conflict then they should waste less customers time .Seagate works a dream and i agree re theory

    Thank you

RSS feed for comments on this post. TrackBack URI

Leave a comment

If you want to leave a feedback to this post or to some other user´s comment, simply fill out the form below.

(required)

(required)