CMG'09: "How do you analyze 100,000s of servers?"
Charles Loboz (microsoft)
- No homogeneous software/hardware/applications
- Access is often limited (e.g. hotmail servers are off-limit)
- In the old days, 1 server analyzed per day
- Stopped using averages and stddev (because data are not normal)
- Built 10-bin histograms for utilization
- Even that is limited, because long tails are the ones triggering issues (e.g. bad queries triggering load, then all queries will pile up)
- No one cares about utilization (except data geeks), only performance matters
- Estimate utilization impact on performance with "Performance Impact Factor" (PIF): a weighted average of histograms, heavy utilization should be favored to make long tails more obvious, for CPU, for net, for IO
- Compute histograms
- Compute PIFs for each server
- Cross-tabulate PIFs to server names to tag servers as underused, overloaded, etc.
- Store everything in a database
- PIF averages don't mean anything
- It's good to tell a "dead-cold" server, but it's not good to tell you that you have an issue, just that you have to investigate