I've been slowly working my way through the server rack, upgrading all of my servers. Some of the machines are as much as five years old, and all spinning gear (CPU fans, case fans, hard drives) are essentially ticking time bombs. In addition there is new hardware to be added to the rack, which means virtually everything in the rack has to move... the new configuration with eight servers completely fills the 30U rack.
What makes this especially challenging is that they ARE servers... they're constantly in use. I can take them down for a few minutes, but after a half hour the phone starts to ring. However, some servers are more sensitive to this than others - and Cartman is one of the least sensitive, since its largely an internal-only server.
Cartman has a variety of tasks. Primarily he's a file server, but also a domain controller (one of two), DHCP and DNS server. As a file server, he has a 400GB RAID array... doesn't sound like much, but I built it in October of 2001. Its done with a Promise SX6000 controller and six 80GB hard drives. At the time, it was a monster. Since its essentially been on since it was first built those drives have over 30,000 hours of spin time... very scary.
Before tearing Cartman apart I used Acronis True Image to image the boot drives, and I backed the entire 400GB drive array up on a single external USB 400GB drive. And yes, I used xcopy with verify and double checked everything before I tore it down.
This is what I saw after hauling Cartman out of the rack and popping the cover. Essentially identical to what I saw in October 2001 - one crammed case. You can see the six ATA/100 ribbon cables coming out of the Promise controller running to the two three drive caddies holding the 80GB drives. In the middle are the two 17GB SCSI drives that are used as boot drives, which, along with the SCSI DVD drive are run from the Adaptec 29160 SCSI controller. Oh, and an Exabyte external tape drive plugs in there too.
Disassembly of this beast starts with the metal bar running across the case that also supports the two SCSI hard drives (and a fan). Then the entire front drive array holding the DVD, floppy and two drive caddies was removed. Both the SCSI and RAID controllers were pulled as well, leaving the case pretty darn bare. With everything out I powered up the machine just to take a look and noticed that one of the CPU fans was barely spinning any more. I had planned on replacing them anyway, this was just extra incentive.
However, the motherboard is so busy that the fancy new Socket 370 cooling blocks I bought wouldn't even fit in the space! But I was able to use the old blocks by removing the worn out fans with the the fans from the new blocks.
After a thorough cleaning, I installed a gigabit network card and began the rest of the reassembly. I'm retiring the Promise controller altogether, going to a SATA array using six Hitachi Deskstar 7K400 drives. Yep, that's right... from a 400GB array to 400GB drives, for a total of two terabytes! And to drive this puppy, I'd need a SATA controller, so I went back to Adaptec for their 2810SA controller.
It actually supports eight drives, but I only had space for six, you can see the controller hard and new caddies to hold the drives. SATA cables are much tidier than ATA cables, so I got a bunch of space back in the case.
Here you can see the Chenbro caddies with three SATA cables a peice. There's one power plug for all three drives (which is very nice) and it also has a heavy blower fan pumping directly onto the drives.
The old 17GB Atlas V drives are replaced with shiny new 147GB Atlas 10Ks. More disk space!
With everything crammed back in the case, it was time to get things set up. Even before I started the install of Windows 2003 server I wanted to get the array set up. What was interesting is that every card installed in the machine had a boot BIOS in it - the SCSI controller, the RAID controller AND the gigabit network card! Getting the BIOS set up to boot from the right device took some fiddling.
Then I decided to start the array configuration from the BIOS, so I set up a RAID 5 array. Being a dilligent geek, I went to the Adaptec web site to check for latest drivers, BIOS updates, and so on. Adaptec had updates for both the 2810SA and the 29160, so I updated both BIOSes. What's stunningly annoying is that you HAVE to install BIOS updates from a floppy. The software is hard coded to read from drive A and nowhere else. Presumably I could set up a USB drive to do this, but this old SuperMicro motherboard ain't that smart.
I was glad I'd checked all this in advance, all over the readme files for the firmware were warnings that doing these upgrades would destroy the existing arrays, and you'd need to back everything up. Since I had nothing on the drives, I had nothing to fear.
Feeling smug with all my firmware flashed, I headed off into the BIOS set up for the 2810SA to get my spiffy new drive array configured. Apparently I did it wrong because I selected “Clean” to start the array rather than “Build/Verify.”
But I didn't know this at the time - off it went, ticking away to itself. I thought it might take a long time to set up a two terabyte array, but it was done in about 15 minutes... well, almost done. It got to 99% and then said “Controller Kernel Stopped Running!” And then the machine would reboot. That didn't seem good.
Every time I restarted the machine and went back into the 2810SA BIOS, I'd get the same error and reboot the machine.
In an effort to be positive about my situation, I ignored the failure and moved on - set up Windows 2003 Server. Once it was up and running, I tried to install the drivers for the controller card, but it wouldn't recognize it. That can't be good either. I filed a tech support request with Adaptec, but wouldn't hear back for 48 hours: by then I would solve it on my own.
I went to bed late, very grumpy. The next morning I woke up thinking maybe the firmware update was a mistake. So I reverted - got the old firmware, set up new floppies and attempted to install it. But it kept failing with the same error. Couldn't revert.
Then, a flash of insight, I realized what was happening to the controller - it was crashing! And right at the point of completing the array. After it rebooted, the controller would restart, see the array almost finished configuring and attempt to finish it... crashing the controller again! So, how to stop the array from rebuilding? Pull all the hard drives out! That'll slow the bugger down.
Sure enough, as soon as I pulled the drives, I was able to revert the firmware. Why I still reverted the firmware, I'm not sure - I guess I had a course in mind and thinking wasn't going to divert it. With the firmware reverted, the array had died, so when I plugged the drives back in, nothing bad happened.
Now afraid of the BIOS configuration stuff, I booted back into Windows, and reverted the driver as well to match the firmware. If you've never done this, you're a happier person than me: reverting to an older driver is a bugger. Windows 2003 Server has a rollback driver option, but it doesn't work if you haven't previously installed the older driver. So I had to do this the hardware - uninstall the driver and then carefully locate all the backup copies of the DLLs and kill them by hand. Once I had it all, installing the old driver worked, AND it came up just fine.
Now I was able to set up the RAID 5 array from Adaptec's client for Windows, which was a whole bunch clearer about the right ways to do things. And that's when I discovered that correctly building a two terabyte array takes an entire day.
The next day I discovered that my two terabyte array is actually a 1.8TB array. And that Windows understands TB, it displays that way in Windows Explorer. Funny, huh? I wonder if they have PB (as in petabyte, a thousand terabytes) in there as well.
The rest of the set up was uneventful, really... things got loaded back on, DHCP and DNS configured, and so on. The next level of excitement would come with the most dangerous update of all... converting an Exchange 2000 server to 2003!