CMG'09: Solaris/Linux Performance Measurement and Tuning (part 2)

Adrian Cockcroft (Netflix) My notes:
  • Netflix releases every 2 weeks, first in beta and tracks everything
  • Everything at netflix (or in web-land in general) instrumented, in libraries so that instrumentation comes for free
  • Beware of kernel tweaks, good for older kernels, now a lot more auto-tuned
  • On Solaris, microstate data very useful
  • With Poisson arrivals, steady state, N identical servers, approximation of response time, R = S / ((1 - utilization)^N), S = service time, utilization = throughput * S
  • Issues with this simplistic model: bursted traffic, service time varies, N servers don't process the same thing,  virtual hardware make it a lot harder to figure out
  • Measurement errors (especially around measuring time)
  • So don't bother about utilization
  • Load average on linux is broken, it includes disk activity
  • I/O wait is fundamentally broken, the cpu never waits for I/O per se
  • Cockcroft Headroom Plots: 99th-%ile against response time
  • On linux, best way to track i/o per process is with SystemTap

On Joyent's recent storage mishaps

I have read with interests comments on Joyent's blog of disgruntled users among which professional system administrators seemed to be found. Beyond the technical merits of the recovery it is quite clear that selling storage services with no backup scheme that would allow data to be available within a matter of hours or a day at best is a dangerous proposition for an internet startup boasting technical prowess. At best it casts a negative light on their understanding of what a backup is and that ultimately I store data remotely not because ZFS or Thumper are cool (they are) but because I hope to be able to retrieve it more elegantly and more simply than going to the bank to retrieve the tape from the vault.

Interestingly enough the brand leader in distributed storage, amazon has an interesting Service Level Agreement that worries solely about the service being available, not that data returned are immune to corruption. I am wondering whether the next step is going to be data insurance, against data loss and format obsolescence. Something to ponder.

In my context neither of these options are appealing since data are the bread and butter of my company, hence the painstaking off-site backup process to mitigate risks and the oh-so-enjoyable-but-needed task of spelling out a comprehensive disaster recovery plan.