CMG'09: Solaris/Linux Performance Measurement and Tuning (part 2)

Adrian Cockcroft (Netflix) My notes:
  • Netflix releases every 2 weeks, first in beta and tracks everything
  • Everything at netflix (or in web-land in general) instrumented, in libraries so that instrumentation comes for free
  • Beware of kernel tweaks, good for older kernels, now a lot more auto-tuned
  • On Solaris, microstate data very useful
  • With Poisson arrivals, steady state, N identical servers, approximation of response time, R = S / ((1 - utilization)^N), S = service time, utilization = throughput * S
  • Issues with this simplistic model: bursted traffic, service time varies, N servers don't process the same thing,  virtual hardware make it a lot harder to figure out
  • Measurement errors (especially around measuring time)
  • So don't bother about utilization
  • Load average on linux is broken, it includes disk activity
  • I/O wait is fundamentally broken, the cpu never waits for I/O per se
  • Cockcroft Headroom Plots: 99th-%ile against response time
  • On linux, best way to track i/o per process is with SystemTap

CMG'09: "How do you analyze 100,000s of servers?"

Charles Loboz (microsoft)
  • No homogeneous software/hardware/applications
  • Access is often limited (e.g. hotmail servers are off-limit)
  • In the old days, 1 server analyzed per day
  • Stopped using averages and stddev (because data are not normal)
  • Built 10-bin histograms for utilization
  • Even that is limited, because long tails are the ones triggering issues (e.g. bad queries triggering load, then all queries will pile up)
  • No one cares about utilization (except data geeks), only performance matters
  • Estimate utilization impact on performance with "Performance Impact Factor" (PIF): a weighted average of histograms, heavy utilization should be favored to make long tails more obvious, for CPU, for net, for IO
Recipe
  • Compute histograms
  • Compute PIFs for each server
  • Cross-tabulate PIFs to server names to tag servers as underused, overloaded, etc.
  • Store everything in a database
Pitfalls
  • PIF averages don't mean anything
  • It's good to tell a "dead-cold" server, but it's not good to tell you that you have an issue, just that you have to investigate

On continuous production

Continuous production is an idea that's probably as old as the first modern blast furnaces, an idea that decades of industrialization have perfected, an idea that has become all the more current in internet operations now that the mom-and-pop web hosting has turned into industrial scale operations. The body of literature regarding the topic of running an online service continuously without eating away profit margins is getting richer by the day, with a crucial contribution from James Hamilton (pdf of LISA paper) summarizing a wealth of traits of online operations that manage to scale from a technical and, as importantly from an economic standpoint.The peculiar nature of software is so that there are actually two distinct types of production:
  1. Production of service/software from need to formal requirements to design, development and release, the end production is a set of components/artifacts that are finished enough to fulfill users' needs (a.k.a. software development)
  2. Production of service/software from these artifacts to an actual service (a.k.a. service production)
Software engineers and their managers typically care about the former while the latter has traditionally been the realm of system engineers and operations. These 2 crowds have quite different cultures, nigh antagonistic since developers are meant to create manageable change while operations is tasked with keeping the whole thing running around the clock. And any experienced person in operations will tell you that things break, in their vast majority, when they change, when they are brought out of steady state. The aforementioned paper offers a way to bridge that gap by bringing developers much closer to operations. That's Amazon's supposed motto: "You build it, you run it". This is a cultural change that has proven natural in a tiny structure where everyone does everything but is proving harder to keep as the company grows, especially when it has not been constantly pushed as the correct way of organizing work. It has been one of my ongoing projects to make sure that we close that precious feedback loop between developers and operations.

"If you see something, say something..."

My team is in the process of evaluating a number of monitoring packages for our development and production environments. So far we have mostly used gum and duct tape ^H^H^H a combination of mon, monit, zenoss, PRTG to get alerted when something goes wrong and we are clearly at a point where this type of contraption cannot really scale anymore. It was good while it lasted but it has now passed its optimal benefit/cost ratio. I have some experience with HP OpenView in the past and the prospect of implementing a beast like that does not really appeal to me. Paying top dollars and top man-hours to get a big "enterprise" piece of machinery to survey 300 hosts on 2 sites really seems overkill to me. Other solutions that we are entertaining are: Since we operate in a largely unix environment but also have a growing number of Microsoft boxes (and VMware ESXs), having one package that is reasonably good at monitoring and alerting is desirable. Configuration management would be a nice plus but is not a strict requirement. I am also thinking of throwing splunk to the mix as a forensic tool that aids in incident and problem management. I will report on our progress as we evaluate these tools.