CMG'09: Solaris/Linux Performance Measurement and Tuning (part 2)

Adrian Cockcroft (Netflix) My notes:
  • Netflix releases every 2 weeks, first in beta and tracks everything
  • Everything at netflix (or in web-land in general) instrumented, in libraries so that instrumentation comes for free
  • Beware of kernel tweaks, good for older kernels, now a lot more auto-tuned
  • On Solaris, microstate data very useful
  • With Poisson arrivals, steady state, N identical servers, approximation of response time, R = S / ((1 - utilization)^N), S = service time, utilization = throughput * S
  • Issues with this simplistic model: bursted traffic, service time varies, N servers don't process the same thing,  virtual hardware make it a lot harder to figure out
  • Measurement errors (especially around measuring time)
  • So don't bother about utilization
  • Load average on linux is broken, it includes disk activity
  • I/O wait is fundamentally broken, the cpu never waits for I/O per se
  • Cockcroft Headroom Plots: 99th-%ile against response time
  • On linux, best way to track i/o per process is with SystemTap

CMG'09: "How 'normal' is your IT Data?"

Dr. Mazda Marvasti My notes on this very informative talk (the best I've seen today). The goal of the study was to evaluate the hypotheses around normal distribution assumption built in the newer IT monitoring tools, that create dynamic thresholds of the various metrics they collect.
  • Analyzed 4 workloads: ad-serving on LAMP, bond processing, stock trades and some online application
  • Test for normal distribution: Kolmogorov-Smirnov as it makes no assumption on the data distributions
  • Used average shifted histograms for the test
  • Results: none of the basic metrics (OS, applications, business-oriented) are normally distributed, neither are their averages, when looking at blocks of 1 hour
  • For instance Monday 9am does not look at all like Tuesday 9am
  • Also Mondays 9am don't on average converge, meaning that their average are not independent and/or the averages are not identically distributed
  • Business cycles matter very much in analysis, spectral analysis can help!
  • Correlations examined using Spearman's ranked correlation coefficient (though results not presented).
  • Conclusion: go for non-parametric analysis, known distributions don't really apply
  • If you enable dymanic thresholds based on normal distribution assumptions, expect a 10x in the number of alerts -- though it's possible to mitigate this with use of topology rules (e.g. "don't alert me if event 1 and event 2 coöccur)
My take on this: IT data analysis is challenging. One question is: how much is it worth, i.e. at what scale do you get your money back (and more) by getting this type of fairly sophisticated analysis and what kind of return can you expect of it? While the answer depends on the nature of the business conducted, I'm curious to see whether it's bigger shops with expensive applications, cloud-scale companies or whether this is going to percolate toward the smaller web shops, integral to an Infrastructure-as-a-Service offering? Stay tuned...

CMG'09: "How do you analyze 100,000s of servers?"

Charles Loboz (microsoft)
  • No homogeneous software/hardware/applications
  • Access is often limited (e.g. hotmail servers are off-limit)
  • In the old days, 1 server analyzed per day
  • Stopped using averages and stddev (because data are not normal)
  • Built 10-bin histograms for utilization
  • Even that is limited, because long tails are the ones triggering issues (e.g. bad queries triggering load, then all queries will pile up)
  • No one cares about utilization (except data geeks), only performance matters
  • Estimate utilization impact on performance with "Performance Impact Factor" (PIF): a weighted average of histograms, heavy utilization should be favored to make long tails more obvious, for CPU, for net, for IO
Recipe
  • Compute histograms
  • Compute PIFs for each server
  • Cross-tabulate PIFs to server names to tag servers as underused, overloaded, etc.
  • Store everything in a database
Pitfalls
  • PIF averages don't mean anything
  • It's good to tell a "dead-cold" server, but it's not good to tell you that you have an issue, just that you have to investigate

#Cloudcamp: storing sensitive data in a public cloud

Yesterday at CloudCamp, a few of us discussed methods to store and use sensitive data on a public cloud, where you presumably do not have strong assurances that your data are for your eyes only. To keep structured data (e.g. relational data) a pattern emerged among participants assuming your data have an easy key:
  1. Store all non-identifiable data in the cloud, keyed by an arbitrary identifier.
  2. Keep actual identifiable data on-premises with the mapping to that arbitrary identifier.
  3. Let the client device resolve the mapping locally.
For instance suppose you store transactions, this would require at a minimum the scrambling of the transaction details such as item name and party name. That offers mitigation against simple analysis of the data, using statistical methods to derive information, which can be acceptable. Of course everyonen remembers the AOL search term fiasco, where people could be identified based on the search terms. Which is why this scheme should work best if the data are highly structured.

Structure08, my impressions so far

It's a day packed with keynotes, panels and shmoozing, with some topics overlapping with Velocity; yet at a much higher level. We've alternated between interesting panels ("Harnessing explosive growth") where the key points are:
  1. a proper architecture lets you scale [much like in traditional building]
  2. build kill switches in all your features
  3. get operations and development on a symbiotic relationship [salesforce and amazon do it]
Some other panels are clearly more about pushing your product ("The race to the next database"). The topic of processing data (possibly in the cloud) is of course crucial yet very few concerns around switching costs, security and privacy are addressed. My take on this is that if you need to run analytics on your data sets and said data sets are huge, you need compute to be close (from a network distance perspective) to your data. Which means that your data must be in the cloud. While I'm reluctant to go down that route right now, Greg Papadopoulos @sun made the compelling analogy that money storage is delegated to reputable third-parties (called banks) so data are likely to follow the same treatment, i.e. the cloud is likely to become the most secure place to store data (or most resilient with an acceptable security policy). Sun's interesting take on cloud computing is Project Caroline, where infrastructure, including network bits, is driven by code, in a way, that's presumable a bit cleaner than EC2 (which is quite bare). Dr. Vogel's presentation @amzn, was inspiring despite containing basically little new information but fits well into this type of conference, which act as reinforcement devices to jumpstart a new industry. Live coverage is at gigaom.

Velocity: Green Data Centers by Bill Coleman

Notes taken from the floor Problem with current data centers: rising energy costs, increased complexity. Current solution: automate further, just pushing back. What's started to happen is an "inflexion point" [I'm not sure I see why this mathemaical term has been chosen], the ability of anyone on the world to be connected to anything in the world, we're getting there. The current cloud: 1.0, IT-centric, used to build proprietary applications. 2.0, store everything on the cloud, with security but still proprietary. Commoditization is unstoppable and is happening in the next decade. How do I get started with green data centers?  Firstly you can shut down servers as soon as you've figured out which ones to turn off. The problem is to find out dependencies and shut down the right servers. Why? Save money but the overarching goal is to drive automation by policy [presumably requiring an ontology to let systems know about themselves]. Average utilization percent for VMs is for 50,000 virtual machines is barely 20%, which compared with mainframe utilization figures is quite low (80%).