CMG'09: "How do you analyze 100,000s of servers?"

Charles Loboz (microsoft)
  • No homogeneous software/hardware/applications
  • Access is often limited (e.g. hotmail servers are off-limit)
  • In the old days, 1 server analyzed per day
  • Stopped using averages and stddev (because data are not normal)
  • Built 10-bin histograms for utilization
  • Even that is limited, because long tails are the ones triggering issues (e.g. bad queries triggering load, then all queries will pile up)
  • No one cares about utilization (except data geeks), only performance matters
  • Estimate utilization impact on performance with "Performance Impact Factor" (PIF): a weighted average of histograms, heavy utilization should be favored to make long tails more obvious, for CPU, for net, for IO
Recipe
  • Compute histograms
  • Compute PIFs for each server
  • Cross-tabulate PIFs to server names to tag servers as underused, overloaded, etc.
  • Store everything in a database
Pitfalls
  • PIF averages don't mean anything
  • It's good to tell a "dead-cold" server, but it's not good to tell you that you have an issue, just that you have to investigate

Velocity: John Allpaw @flickr, Capacity Planning

What can cause downtime:
  1. bugs
  2. edge cases
  3. security incidents
  4. real capacity problems
Deployment and management tricks from the HPC world: ganglia, System Imager Gather metrics of course, and build models, ideally out of live data, rather than artificial benchmarks. fityk can be used to replace excel to do curve fitting. [My guess is that R would work great for that too] Some flickr stats: 12,629 nagios checks, 1314 hosts, 6 data centers, 4 photo farms, 3.5-4.5 TB consumed per day. [So flickr uses nagios + ganglia] One key trick is to build kill switches in all the features so as to turn things off when load increases.

On continuous production

Continuous production is an idea that's probably as old as the first modern blast furnaces, an idea that decades of industrialization have perfected, an idea that has become all the more current in internet operations now that the mom-and-pop web hosting has turned into industrial scale operations. The body of literature regarding the topic of running an online service continuously without eating away profit margins is getting richer by the day, with a crucial contribution from James Hamilton (pdf of LISA paper) summarizing a wealth of traits of online operations that manage to scale from a technical and, as importantly from an economic standpoint.The peculiar nature of software is so that there are actually two distinct types of production:
  1. Production of service/software from need to formal requirements to design, development and release, the end production is a set of components/artifacts that are finished enough to fulfill users' needs (a.k.a. software development)
  2. Production of service/software from these artifacts to an actual service (a.k.a. service production)
Software engineers and their managers typically care about the former while the latter has traditionally been the realm of system engineers and operations. These 2 crowds have quite different cultures, nigh antagonistic since developers are meant to create manageable change while operations is tasked with keeping the whole thing running around the clock. And any experienced person in operations will tell you that things break, in their vast majority, when they change, when they are brought out of steady state. The aforementioned paper offers a way to bridge that gap by bringing developers much closer to operations. That's Amazon's supposed motto: "You build it, you run it". This is a cultural change that has proven natural in a tiny structure where everyone does everything but is proving harder to keep as the company grows, especially when it has not been constantly pushed as the correct way of organizing work. It has been one of my ongoing projects to make sure that we close that precious feedback loop between developers and operations.

Great presentation from Dan Pritchett at eBay on operational manageability

Here at infoq.Dan made a number of interesting points that resonate with our experience:

The need to figure out dependencies before crisis hits

The main point here is that good software design heavily promotes resource abstraction. Databases become simple data sources that can be used without worrying about their whereabouts. So this makes dependency mapping an intricate task if it is not conducted in parallel to the development process. We have all sifted through lines and lines of J2EE thread dumps (to cite an extreme) in a moment of crisis to end up looking at 5 different configuration files and pinpoint exactly which resource is failing. Right there that requires someone who is familiar with application server internals and that someone is usually someone from the application development team rather than a system administrator.

Power crunch

Software does not get efficient quarter after quarter, rendering efforts on the hardware side negligible.

An active/active disaster recovery setup is often preferable to active/passive

The initial set-up cost in complexity and capital expenses is often by two critical factors:
  1. active/active means that both sites are tested as opposed to a passive site, which you discover at the worst moment possible, won't be able to serve without configuration changes
  2. active/passive failover is typically disruptive (lost sessions, etc.) so the business impact is higher
Another nicety of an active/active set-up is that, as you multiply the number of sites, each one requires less excess capacity to sustain the loss of 1 site. Set up 3 sites and each needs to have 150% of nominal capacity to absorb the loss of one; 4 sites means 133%, etc.

Developers have 2 groups of customers, real customers and the ops team

That is the gist of the "Release It!" book mentioned in an earlier post.

Some figures about eBay's current infrastructure

They use 5000 application servers running on commodity hardware and about 300 database servers running on mid-range Sun hardware.

Release It! is a great book

At long last a book about production-ready software aimed at developers. More often than not (except in the small web 2.0 startups) people who write software do not operate it and vice-versa. Hence the prevailing mentality among developers than tested means done. "Release It!" aims at correcting this misconception and it does a great job at it.