Archive for category Red

Dell PowerEdge 2850 RAID Failure

About three weeks ago one of my newest servers had a major failure. The server runs a very critical business web application so uptime is very important. For this reason we configured a very reliable server.

Dell PowerEdge 2850
Dual Xeon 2.8GHz CPU
2GB ECC DDR-SDRAM
RAID 5 LSI RAID Controller
6 x 73GB Maxtor SCSI Disks

Below is the complete saga with Dell broken down.

On 4/27/2006 around 10AM we attempted to log into our clients production web server which hosts their critical business application. Our logins were successful, but we were immediately being bumped back to the login screen. We checked the servers drive state and noticed it had rejected a hard drive from the array. To resolve the problem we attempted to restart the server, however it was unable to fully boot into Windows. FAIL at this point the server is completely unresponsive and will not bring up a login screen.

After some research we found the RAID control was corrupting the data which was being written to the disk. This had been going on for some time, and appears to have corrupted the winlogon.exe.

Within a short amount of time we were able to bring their corporate website back online on the backup server. However their business application took significantly longer to bring back up because of the frequently changing data. To retrieve the live data off the server we booted using a restore tool, and copies the files / database onto a spare hard drive. This was to ensure we had the 100% most recent version of data from any morning transaction. We were able to bring everything back online by 3:30PM on the backup server.

We immediately called Dell who recommended we upgrade the firmware server RAID controller. Dell pointed out that this specific machine had shipped with a firmware which had known problems. First thing in the morning on 4/28/2006 we upgraded the firmware to the recommend version. We also upgraded the motherboard BIOS firmware as recommended by Dell. After letting the server run we scheduled a turn-up for Tuesday May 2nd. This was intended to give the server time to burn-in and ensure the firmware fixed the problem. We also had to completely reinstall Windows 2003 Server + MS SQL server 2000 to bring the server back online.

On 05/02/2006 at 8PM we attempted to bring all the data back over to the production machine. Immediately we checked the servers drive state and noticed it had rejected another hard drive from the array. This is a sign that the problem was not fixed from our firmware update.

The next morning I called Dell back and they shipped overnight a new RAID “Key” chip, controller card memory, and a new backplane for the drives to mount into. We replaced all of this equipment and let the server “burn-in” to ensure this fix would work. We scheduled another launch date for 05/09/2006 after letting the server run over the weekend. At 7PM we meet at the office and moved the site files and database back over. At approximately 9PM, after a final reboot we noticed the server rejected another hard drive. To be safe we immediately moved the site back to the backup server.

On 05/16/2006 after heavy lobbying Dell shipped a new server which seems to have resolved the problem. After further inspection I noticed they changed SCSI hard disk vendors. It’s my theory there is something wrong between MAXTOR + LSI RAID, but at this point I cannot prove anything. The replacement Seagate’s seem to resolve the problem.

This is intended to be a heads up for anyone dealing with the same issue. Level 1 Dell server support seemed to have failed us here, however once the problem was escalated they took action quickly to ship a new server.

No Comments

Getting Married

I have started a new website called Wright Family to document the upcoming wedding. Sorry for not posting as much information lately. If you’d like to follow up on the progress visit out photo journal website.

No Comments

A Whole New World

I am in love with a Geek. It’s true… a big brain and lightning fast typing speed is a turn on. With such a love comes a wide range of new experiences however. I would like to take you on a magic carpet ride to the world of “geek” told from an outsiders view. It has been quite an experience trying to learn all the in’s and out’s of the IT world without seeming completely stupid.

First thing that I had to learn was slashdot. Yes, the wonderful world of slashdot….the last viewed page in every geek’s web browser. At first, I thought that this portal to the IT community was simply a page of enjoyment, a place to go when there is no better thing to do at work. I, however, was wrong. It is my firm belief that slashdot has hypnotic powers that are beyond my comprehension. Five, Six, Seven times a day I will see my geek viewing this site. Not much has changed from the last view, maybe a new post or two. Every time though, he goes to the site with anticipation…hoping….praying for new content. To me, this is absolutely silly. I have learned that I sometimes take the backseat to a good Slashdot post. Geeks need their slashdot, and I have just learned to respect that.

The next thing I have learned was that in every room of the house there needs to be some computer equipment. While it is helpful at times to be able to check my email at any point, sometimes such equipment can be excessive. Take for example the VERY large, cumbersome antanea in the upstairs “lab.” This antanea was in fact pointed at the neighbors house for weeks, before I relized that the neighbors probally assumed my geek was taping into their personal lives. The antenea was bought because my geek wanted to broadcast wireless internet to the neighborhood and market it. Well….for a year now that antena has been sitting in the “lab” doing nothing more useful than collecting dust. Why is such equipment so nessecary….well because it is “cool” of course! Why else would you need 17 computer cases, 2 boxes of mice, 2 boxes of keyboards, 37 “semi-functional” hard-drives, 3 non-working wireless network cards, 20 phone modems, 17 motherboards, a broadcasting antena, 295 feet of coax cable, 428 feet of ethnet, 12 access points, 3 routers,a voice of IP box, oh…and a partridge in a Pear tree.

So, in conclusion. Loving a geek isn’t easy…but hey, I guess someone has to do it.

No Comments