CMG'09: "How 'normal' is your IT Data?"

Dr. Mazda Marvasti My notes on this very informative talk (the best I've seen today). The goal of the study was to evaluate the hypotheses around normal distribution assumption built in the newer IT monitoring tools, that create dynamic thresholds of the various metrics they collect.
  • Analyzed 4 workloads: ad-serving on LAMP, bond processing, stock trades and some online application
  • Test for normal distribution: Kolmogorov-Smirnov as it makes no assumption on the data distributions
  • Used average shifted histograms for the test
  • Results: none of the basic metrics (OS, applications, business-oriented) are normally distributed, neither are their averages, when looking at blocks of 1 hour
  • For instance Monday 9am does not look at all like Tuesday 9am
  • Also Mondays 9am don't on average converge, meaning that their average are not independent and/or the averages are not identically distributed
  • Business cycles matter very much in analysis, spectral analysis can help!
  • Correlations examined using Spearman's ranked correlation coefficient (though results not presented).
  • Conclusion: go for non-parametric analysis, known distributions don't really apply
  • If you enable dymanic thresholds based on normal distribution assumptions, expect a 10x in the number of alerts -- though it's possible to mitigate this with use of topology rules (e.g. "don't alert me if event 1 and event 2 coöccur)
My take on this: IT data analysis is challenging. One question is: how much is it worth, i.e. at what scale do you get your money back (and more) by getting this type of fairly sophisticated analysis and what kind of return can you expect of it? While the answer depends on the nature of the business conducted, I'm curious to see whether it's bigger shops with expensive applications, cloud-scale companies or whether this is going to percolate toward the smaller web shops, integral to an Infrastructure-as-a-Service offering? Stay tuned...

Thinking about IT Operations and Kanban

As our developers are transitioning to an agile methodology, we have been figuring out how to adapt our operational processes to a more regular schedule that fixed-length, 1-month-long sprints are going to entail. So far we have worked in a more waterfall approach with a high-level of interrupts, taking on projects, doing an upfront analysis to break work into small chunks and piping such chunks through FogBugz to track progress. Recently the team has been reading about kanban as a way to formalize flow and make under-capacity visible. While I believe we have adopted an informal pull-driven process, now is the time to formalize all this so as to properly communicate whether and when infrastructure projects can be delivered.

The first round of experiments is taking shape, more to follow shortly:

Media_httpfarm4static_cfbkf

twiki is great... twiki is not so great.

To organize our internal IT information we have been using twiki. It is a very flexible tool by virtue of being a wiki and has two critical features out of the box that other wikis seem to lack:
  1. Forms
  2. and a fairly interesting search directive (%SEARCH%)
If you are not using these with twiki you are missing out; the analogy is using Word without styles. You can do without but life is so much easier with them. Forms brings structure to wiki pages and allow to treat wiki pages as a structured record (the form) with a big, free-form description field (the page). For instance our twiki implementation records hosts, hardware items, services, change requests, incident tickets all with the use of custom forms, so as to produce a pseudo-relational database on which we build reports. Examples of reports:
  1. list of all change requests awaiting peer review before approval
  2. list of all hosts assigned to a given project
  3. list of all hosts running on a given piece of hardware
The list goes on. Then we start having questions such as "which are the hardware pieces whose leases end in the next 3 months?" or "how many hosts run RedHat 4.5?". And that is when twiki breaks... Its reliance on a file-based scheme (and rcs) to maintain relationship imposes some unwelcome limitations, not to mention a level of performance that is difficult to accept on a daily basis (I know that caching is in the works but it is just not built to scale). Case in point: we define hosts (think linux hosts) as compute resources that execute on some physical substrate (think IBM x3550) so it is only natural that the host form has a mention for the hardware item it executes on. In other words there is a one-to-n relationship between hardware item and host. On the hardware item form we do not feature the list of hosts that live on that hardware item because chances of dangling pointers are too great. We used to have it and quickly we ended up with hosts that point to a piece of hardware, which itself does not point back to these hosts. In other words we have had to limit the type of reports we can run because the underlying data implementation of twiki is lacking. Questions such as "Which hardware items are home to more than 3 hosts?" become unnecessarily complicated, whereas with the proper framework it becomes as simple as: select hi.name, count(*) from hardware_item hi join harware_host ho on (hi.sid = ho.hardware_sid) group by hi.name having count(*) >= 3 How about the list of potential single points of failure for a given service: select max(h.name) as hostname, hc.name as host_class from service s join service_host so on (s.sid = so.service_sid) join host h on (so.host_sid = h.sid) join host_class hc on (h.class_sid = hc.sid) where s.name = "My critical service" group by hc.name having count(*) < 2; Now, assuming I have such a relational database that twiki can query via a sql module, how different is it from the database that my monitoring package is based upon? In the ideal world my data model presents something that:
  • monitoring can use (service dependencies, host maps, etc.)
  • configuration management can use (change requests bound to given hosts, software items, etc.)
  • asset management can use
  • finance can use
The key properties that I would want such a system to keep are the ease of use with which it be manipulated (nothing more cumbersome that twiki) and its accuracy (no duplicate data). At the same time I have not found any product out there fits the bill (a monitoring package that has a solid data model that be extended for other uses). So I might just bite the bullet and build a prototypical ERP for IT. Stay tuned.