Well, it was a great project, fun, challenging, and interesting. I was in slightly over my head and learned a great deal. I watched it grow and its requirements grow and actually its budget grow and the marketing folks told us that on day one there would be so many people showing up the continent was likely to become unbalanced and tip over. That was sort of a bad forshadowing.... And so like so many other projects (software has a something like 50% failure rate) it imploded and stopped on account of war, or so they said. We withdrew with honor.
Some details that probably won't help you on your current project but seemed interesting at the time. We were going to build two completely redundant sites hooked up with many and redundant T1s and VPNs -- the T1s were expected to have the least latency, but the VPNs were expected to be the most reliable as they are routable and supposed self healing.
IBM has a product like this for their Z servers, but it only works over about a 40 mile distance -- uh, that's good, they literally sell that as letting you have your main system in Manhattan and a backup in New Jersey. But it was too close for our needs (earthquake, hurricane, wars can still get to both sites.) So we had a client requirement to separate the sites by at least a thousand miles and an international border.
Over the course of about six months, while splashing their names all over the papers, Worldcom and Qwest's T1 prices dropped at least 70%. Soon they were so low I just wanted to buy them a case at a time and hand them out as office prices. T1s stretching across an ocean at a couple hundred a month.
We had tons of servers to offload and partition the problem (for instance we had two redundant machines whose only job was to be the syslog servers for everyone -- we felt that would make it harder for any potential hacker to eliminate his tracks and the cost for that is basically zero) but the system was actually fairly simple and not all that expensive until we started having SANs and the likes tossed at us to get the disk performance we thought we would need in order to satisfy the marketing projections.
Ignoring all the other servers and HSRP routers in the picture, with two completely redundant sites each with two database servers, and with one database server designated primary and the others receiving constant updates and keeping everything in sync, and with our entire database able to be kept in memory at one time, we eventually decided to throw away the disk drives. Our model could fit into memory, we didn't feel we would have a problem with locks, and we replicated the requests to the hot stand by machines. And it turned out that more than one productized real time database engine does the same thing. Run it without ACID on disk and it ticks over at 50K transactions a second on good but not great hardware and then run it with ACID and disks and it comes down to a few hundred transactions a second.
It was a real choice project, and when it was
over, I'd never want another (well not for a couple of weeks.)
So we tossed out the SAN and then went the RAIDs and basically we were planning on building a specialized application engine that kept sync locally with his buddy and kept sync remotely over the VPN and T1s. And by paying enough for high availability hardware we felt our replication streams would do the job for us and give us the five nines we sough. And sticking on our business hats, we realized that well, in our domain, if worse came to worse during a failover and someone's transaction was lost (and we didn't see how that could be), that that was actually a cost that occurred infrequently enough to be covered in customer support's budget. I don't think that's a total copout, I actually think that given the right domain, it's a reasonable line item just as shrinkage is.
Now our lead enginer, he was brilliant and outstanding in every way
and he was a good man too. Humanitarian man, man of wit, of humor. We worshiped the man, like a god, and followed every order however ridiculous.
October 9th, 0430 hours, sector 13420, head 15:
I watched our server crawl along the edge of a straight razor. That's my
dream. That's my nightmare. Crawling, slithering, along the edge of a
straight razor, and surviving.
So I'd like to say this worked and the continent did tip over, but what I think happened is that someone realized that our methods, marketing and engineering had become unsound and very obviously insane.
They, we, were out there operating without any decent restraint.
Totally beyond the pale of any acceptable engineering and business conduct.
And he was still in the field commanding his troops.
And so the project was terminated.
"PBR Street Gang, this is Almighty, over...
This is Almighty, standing by, over.
This is Almighty, how do you copy, over...
"The horror. The horror..."