CMG'09: "How do you analyze 100,000s of servers?"

Charles Loboz (microsoft)
  • No homogeneous software/hardware/applications
  • Access is often limited (e.g. hotmail servers are off-limit)
  • In the old days, 1 server analyzed per day
  • Stopped using averages and stddev (because data are not normal)
  • Built 10-bin histograms for utilization
  • Even that is limited, because long tails are the ones triggering issues (e.g. bad queries triggering load, then all queries will pile up)
  • No one cares about utilization (except data geeks), only performance matters
  • Estimate utilization impact on performance with "Performance Impact Factor" (PIF): a weighted average of histograms, heavy utilization should be favored to make long tails more obvious, for CPU, for net, for IO
Recipe
  • Compute histograms
  • Compute PIFs for each server
  • Cross-tabulate PIFs to server names to tag servers as underused, overloaded, etc.
  • Store everything in a database
Pitfalls
  • PIF averages don't mean anything
  • It's good to tell a "dead-cold" server, but it's not good to tell you that you have an issue, just that you have to investigate

How about sub-second queries in Hadoop?

Two observations from talking and listening to people during the Hadoop summit; firstly hadoop is used quite often to process clickstream data -- in all fairness I missed the talk about hadoop used for genomics. Secondly and a corollary of the first, sub-second queries in hive or pig are not quite there yet. Since a hive query translate into maps and reductions their scheduling determines in addition to the sheer volume of data is going to determine response time. Undoubtedly pre-computing aggregates is a natural way to go much like what is done for data warehouses. Where these aggregated should be stored for consumption is a problem that could to hybrid solutions. Process data with hadoop and export then to postgres or infobright to enjoy a more mature (but less scalable) run-time environment. Get multi-terabyte daily processing and sub-second analytics and all that open source. If you've done something like that, I'd be interested to know before I embark on a route where others have failed before.

Open cloud manifesto, not much radicalism here

The manifesto triggered copious traffic thanks to the backroom-smoke-filled air of its inception. I wish the same could be said about its less-than-radical contents. If you were expecting a stated vision about the cloud as the substrate of all future computing that's not a mobile phone or nettop, no such thing there. It sounds more like the cries of small players about to be crushed by the non-signatory parts, i.e. Amazon, Google and Microsoft.

Video on Hadoop from Yahoo!

http://us.dl1.yimg.com/download.yahoo.com/dl/ydn/hadoop.m4v Key points:
  1. cpu, ram size, i/o bandwidth increase exponentially; hard drive seek times do not.
  2. Relational databases and their b-tree datastructures require ln(n) seeks as a crude simplification.
  3. Sort/Merge algorithms working on flat files operate as function of the transfer rates or bandwidth, not seeks (ln(n) is mentioned but I'd think it's at least n.ln(n).
  4. Flat files allow data to not conform to a preconceived schema and is good for exploration
  5. Commodity PCs offer the best computation bang for your bucks but failures are bound to occur frequently; that's the google model of scaling out
Overall this reinforces the idea that MapReduce and its ilk are best-suited for loosely-structured data, or rather data which one does not know how to query and structure beforehand.