Here at infoq.Dan made a number of interesting points that resonate with our experience:
The need to figure out dependencies before crisis hits
The main point here is that good software design heavily promotes resource abstraction. Databases become simple data sources that can be used without worrying about their whereabouts. So this makes dependency mapping an intricate task if it is not conducted in parallel to the development process. We have all sifted through lines and lines of J2EE thread dumps (to cite an extreme) in a moment of crisis to end up looking at 5 different configuration files and pinpoint exactly which resource is failing. Right there that requires someone who is familiar with application server internals and that someone is usually someone from the application development team rather than a system administrator.
Power crunch
Software does not get efficient quarter after quarter, rendering efforts on the hardware side negligible.
An active/active disaster recovery setup is often preferable to active/passive
The initial set-up cost in complexity and capital expenses is often by two critical factors:
- active/active means that both sites are tested as opposed to a passive site, which you discover at the worst moment possible, won't be able to serve without configuration changes
- active/passive failover is typically disruptive (lost sessions, etc.) so the business impact is higher
Another nicety of an active/active set-up is that, as you multiply the number of sites, each one requires less excess capacity to sustain the loss of 1 site. Set up 3 sites and each needs to have 150% of nominal capacity to absorb the loss of one; 4 sites means 133%, etc.
Developers have 2 groups of customers, real customers and the ops team
That is the gist of the "Release It!" book mentioned in an earlier post.
Some figures about eBay's current infrastructure
They use 5000 application servers running on commodity hardware and about 300 database servers running on mid-range Sun hardware.