Revision 9: October 2001
How appropriate that this month ended with Halloween, because rack work has been a nightmare. Everything that could go wrong, did go wrong.
It started with the 5U Case. The case came in September, but had the wrong drive caddies in it - SCSI instead of IDE. A two week delay waiting for those. While I was waiting, I started working on the details of rebuilding the file server. Starting with finding the right motherboard.
One thing that always bothered me about the old file server was the SCSI controller, an Adaptec 29160. This is a 64-bit PCI SCSI controller, which I had plugged into a 32-bit PCI slot. I wanted the controller in a 64-bit PCI slot, but there aren't many motherboards with 64-bit PCI.
In fact, all the motherboards that I could find that had 64-bit PCI were server boards, typically with dual processors and built in SCSI. But if it has built-in SCSI, then I don't need the 64-bit PCI! Finally, I found the SuperMicro P3TDLE - dual processors (which isn't necessary, but doesn't hurt), 64-bit PCI and NO onboard SCSI, so my controller was still useful. It was also substantially less expensive than most server motherboards.
It also had a built in Intel 10/100 network connection, but no video - and no AGP slot! Its a server motherboard after all, who needs AGP? So along with the pair of 1Ghz processors and 512Mb of ECC SDRAM, I needed a PCI video card.
With the motherboard installed, I tried to figure out where all the drives would go in the system. The IDE caddies held the six 80Gb IDE drives, the DVD on top of the drive caddies, and the floppy on the side... where would the SCSI hard drives go? There was space for one drive beside the floppy, but the specs said there was room for two hard drives. Stumped, I contacted the vendor, who admitted that the two drive mounting was a separate part NOT included... so I got him to include it in the IDE caddy swap order.
When the IDE caddies finally arrived, they weren't black! Well, the caddies themselves were black, with purple handles (???), but the frame they plugged into was bare metal. No wanting to involve the vendor (and incumbent delays), I took it upon myself to get a spray can of flat black Tremclad paint. A bit of disassembly, taping and stuffing and a move outside got the face of the frame painted black.
To make the six IDE drives work I ordered a Promise RAID controller which turned out to be backordered, with more delays. The controller was the last part to arrive. With it in place, I was ready to do the serious assembly work. The controller is a full length PCI card (don't see those all that often anymore), and each hard drive in the array has its own ATA/100 cable. The controller also has a 72-pin EDO RAM slot for 8-128Mb of RAM. I immediately ordered the 128Mb SIMM (what the heck).
After testing that the RAID controller booted up properly, it was time to get the SCSI gear out of the old file server and begin final assembly of the new file server. I disassembled the old file server to retrieve the SCSI controller, SCSI hard drives and SCSI DVD and put the new file server together.
And it didn't work.
There were so many problems with the system, it was hard to figure out where to even begin. It started with a conflict between the Adaptec SCSI controller and the Promise RAID controller. Even though the Promise controller uses IDE drives, its recognized as a SCSI controller, so there's some debate within the machine as to which controller to boot from. In addition, I wanted to initially boot from the SCSI DVD drive, so that I could do a clean installed of Windows 2000 Server. To boot from the SCSI DVD, you had to configure the Adaptec controller to boot from the DVD, in addition to convincing the motherboard that the Adaptec controller was the right controller to boot from. Suddenly you can understand why the Promise was pulled.
The the Promise controller out I was able to get installations started, and that's when the video card crapped out. Of course, I just explained all the problems in a paragraph or two, but the diagnostics took most of the day - especially when the video failure wouldn't occur until Windows 2000 was ready to boot into GUI mode.
Figuring the video card was defective, I tried to get a new one - but PCI video cards don't grow on trees anymore (everything is AGP), so it took another day to find a couple more PCI video cards. By this time, my network was starting to suffer from a lack of file server. Remember that the file server in this case is also a domain controller (Kyle, the Exchange server, is also a domain controller) and its the only DHCP server - and the IP leases expire in a day.
The good news is, cynic that I am, I had anticipated problems and not touched the boot drive of the old file server - I'd used the secondary drive as the boot drive for the new server. So I threw the old file server down beside the crippled new one and got it up again.
If you look close at the picture above, you'll see that I went into serious testing mode. With the SCSI controller involved in the old file server, I hung an IDE hard drive and DVD off the IDE controller built into the motherboard, purely for testing purposes. Everything was out of the machine, the only card installed now as the PCI video card.
Three different PCI video cards ALL demonstrated the same sorts of problems - broken up video, hanging when resolution changes were made, and so on. Different brand were tried - ATI, S3, etc. With everything out of the machine, there was really only one conclusion to be made - the motherboard was defective.
All this testing had taken a couple of days, so my happiness level was pretty darn low. So when the manufacturer dragged their feet about replacing the motherboard, I got pretty angry pretty darn quickly. In hindsight, it probably wasn't too unreasonable to be doubtful - how could a motherboard screw up video? In the end, with a substantial amount of fuss (this is a pretty exotic motherboard, it had to be special ordered), the motherboard was replaced and the video problems disappeared. And ten days had past since I first got all the parts together.
After a quick all-up test, everything seemed to work, so a careful, complete installation was done. Setting up Windows 2000 Server takes awhile anyway, but formatting and RAIDing 400Gb of drive space takes even longer. The successful install ran over night.
Think I was done? The system worked, but I worried about it a bit - I had only put a 300 watt power supply in it, and at peak load (startup), my estimates came to nearly 400 watts. The power supply could take it (peak load maximum was 450 watts), but it was a bit scary. I found a locally supplied 465 watt power supply (which peaks at 600) that has better cooling and the same form factor as the 300 watt. Also, while down in LA (great party Ken!) I visited Frys and picked up a rounded single drive floppy cable (in fashionable blue).
The other issue with this system is heat. There are actually two thermal sensors that came with the case that plug into the display panel. They're configured to turn on a huge nine inch fan that mounts to the front of the case when the internal temperature hits 50° C (122° F)... I haven't heard this fan yet (I'm sure it'll be a hurricane), but it does mean its not all that hot in there. Its the hard drive array that I worry about most - having six 80 Gb drives in such close proximity (a quarter inch apart) can't be good. I may have to find a way to stuff a fan in there somewhere.
Finally the big beast worked. And only six weeks in the making!
So with the new server up, the old file server could be gutted - the case going to the Linux server and the remaining parts of the file server go into the new 1U system.
The 1U machine went together pretty easily, with only one complication - the CPU fan. The stock (Intel provided) fan for a 1Ghz processor is almost two inches high... and the case itself is only 1.75 inches! A fan specifically designed for 1U machines was required, which meant the case would fit, but the processor does run pretty hot - around 45° C.
There is still more hardware work to be done, moving the Linux box into the 2U case that the file server once occupied, but the month was running down, and I had had enough of working on computer hardware problems. Besides, I'm not Linux savvy enough to know how to move it safely with hardware changes. Apparently its easy, but I'm waiting for info from my local Linux guru before I do it.
So was the grief of October over? Nope, one more stab in the kidneys to go. Throughout the month, when tired of fighting with computer hardware, I turned my attention to the Cisco router. Via EBay and other sources, I acquired the parts to fully upgrade the Cisco to its optimal configuration - 32Mb of Flash RAM, 64Mb of DRAM. I also flashed it up to the latest version of the Cisco IOS. With books in hand I did lots of reading, and made some progress. But the one piece of functionality I couldn't get working was failover - being able to automatically swap from using the DSL connection to the cable connection and back again, depending on what service went down.
Actually, that's not entirely true. I did successfully test failover by giving the router two default routes - it would use whichever one came first (typically the DSL), but when the DSL interface went down, it would switch to cable, right?
What complicated the issue was NAT (Network Address Translation). NAT is what all these little LinkSys routers do, using a single "real" Internet IP address to the world with multiple inside IP addresses (192.168.x.x or 10.x.x.x, etc). Cisco definitely supports NAT, in fact, I was able to get NAT working just fine with DSL. It was NAT combined with failover that failed.
With two Internet connections plugged into the Cisco, there needs to be two NAT pools. Both these NAT pools are configured to work with any of the inside IP addresses of the network. The Cisco just uses the first one - again, typically DSL. But when DSL goes down, the route over DSL dies, but the NAT pool does not. Any machine attempting to communicate over the Internet fails to get out because the NAT pool of the dead connection is still trying to handle it.
I actually ended up hiring a Cisco professional to spend a few hours trying to make this work, and he struck out as well, so I can't blame this failure on my inexperience with Cisco. This failure likely explains why I could never find any reference to a similar configuration on the Cisco site - I found non-NAT multi-homed Internet set ups, multi-homed NAT with known networks (as opposed to the Internet, which is an unknown network), just not my exact set up. And Cisco wouldn't put up a bulletin that says "See this configuration? Can't do it."
I see it as a bug in the Cisco IOS, but that doesn't help the fact that it doesn't work. The price you pay for being on the bleeding edge is being wrong, once in awhile. I just wish I wasn't wrong here, it was such a cool solution, and it cost a bundle, too.
However, two days before I hit the wall with the Cisco, I ran across an alternative solution to the problem - Nexland's ISB Pro800 turbo. If I had found this product three months earlier I likely would never have even attempted the Cisco solution. The Nexland is a NAT router (like the LinkSys models we know and love), with TWO WAN PORTS. It has a variety of settings for failover, load balancing, and so on. This is a product literally built to do what I'm looking for, so I ordered one... we'll see how it works.
Issues with this revision: