Cassandra training with Jon Ellis from Riptano

Riptano, a newly-formed venture now offers training and commercial support of Cassandra, a key-value store of Facebook lineage. Cassandra's initial claim to fame is being the data store behind facebook's inbox. The training session started with a relatively high-level presentation of Cassandra's data model before jumping quickly into some real code from Twissandra, a simplified twitter clone based on Django. From there we were introduced to super-columns and their limitations, i.e. their subcolumns are not indexed so one should not pack too much in a super-column. As the day progressed we started to get deeper into operations and internals where the rubber usually meets the road and Jon was obviously very well-acquainted with the subject matter. My suggestion would be to add more diagrams to the presentation materials to illustrate the numerous points made during the session. Overall, considering the relatively paucity of documentation on Cassandra Jon's in-depth session is a nice shortcut to spending time scouring mailing lists and reading the source code to get a solid grasp of the topic. In the context of DataDog we use Cassandra to persist reliably and with little latency all inbound signals. But I'll save details for later...

CMG'09: "How 'normal' is your IT Data?"

Dr. Mazda Marvasti My notes on this very informative talk (the best I've seen today). The goal of the study was to evaluate the hypotheses around normal distribution assumption built in the newer IT monitoring tools, that create dynamic thresholds of the various metrics they collect.
  • Analyzed 4 workloads: ad-serving on LAMP, bond processing, stock trades and some online application
  • Test for normal distribution: Kolmogorov-Smirnov as it makes no assumption on the data distributions
  • Used average shifted histograms for the test
  • Results: none of the basic metrics (OS, applications, business-oriented) are normally distributed, neither are their averages, when looking at blocks of 1 hour
  • For instance Monday 9am does not look at all like Tuesday 9am
  • Also Mondays 9am don't on average converge, meaning that their average are not independent and/or the averages are not identically distributed
  • Business cycles matter very much in analysis, spectral analysis can help!
  • Correlations examined using Spearman's ranked correlation coefficient (though results not presented).
  • Conclusion: go for non-parametric analysis, known distributions don't really apply
  • If you enable dymanic thresholds based on normal distribution assumptions, expect a 10x in the number of alerts -- though it's possible to mitigate this with use of topology rules (e.g. "don't alert me if event 1 and event 2 coöccur)
My take on this: IT data analysis is challenging. One question is: how much is it worth, i.e. at what scale do you get your money back (and more) by getting this type of fairly sophisticated analysis and what kind of return can you expect of it? While the answer depends on the nature of the business conducted, I'm curious to see whether it's bigger shops with expensive applications, cloud-scale companies or whether this is going to percolate toward the smaller web shops, integral to an Infrastructure-as-a-Service offering? Stay tuned...

Looking into system performance of an Oracle data warehouse

Introduction

This is the start of an ongoing investigation into system performance of an oracle 10.2 data warehouse being loaded . The database server has 2 real storage volumes (called dw-clear and dw-encrypt) and 1 virtual one (dw-encrypt-u) used to decrypt data on the fly. Most of the data and the i/o are on the dw-clear volume. System-data performance have been collected via sadc -d to capture per-device statistics. The data are then extracted using sadf -d filename -- -d -b -d. The summary is available here as a csv. It's a large table of block i/o stats, cpu stats and per-device i/o stats, suitable to be imported into R. The system characteristics are as follows.
  • Sun x4150 64GB RAM, 2x4 x5450, 1 4Gb/s QL2462 HBA with 2 ports.
  • 3 device-mapper devices, 2 using a round-robin multipath (v1, v2), 1 using an on-the-fly cipher to decode encrypted data (v3).
  • 3PAR S400 with 10k drives and 4Gb/s HBAs.
  • Out of the 64GB, 8GB are set aside as HugePages to serve as memory pages for the SGA.
The goal of this investigation is to understand what the bottleneck is in the processing and what can be done to remove it. Let's start with cpu utilization. [caption id="attachment_172" align="aligncenter" width="510" caption="Distribution of CPU time spent in userland when not idle"]
Media_httpscaleordief_vwdgl
[/caption] Not terribly loaded (I'm filtering out the long idle portions with user > 5. How about I/O? [caption id="attachment_177" align="aligncenter" width="510" caption="% of CPU spent waiting on IO"]
Media_httpscaleordief_fwbab
[/caption] Interesting, iowait is not negligible. Is it correlated to anything in particular? First of all, let's see how iowait varies with device utilization of v1.
Media_httpscaleordief_ihhfi
v1 is slowly but surely bringing iowait higher, to the point than more than one processor ends up waiting on I/O. To be continued...

Blog battle on the storage appliance front

Backblaze has started an interesting conversation by detailing how they get to $117,000 per PB, down to the type and number of SATA card used in their design. A great PR move for a company in the crowded personal backup space. Of course publishing comparisons with Dell, Sun, NetApp and EMC at 8x, 10x, 30x the price is a sure way to start stirring people's emotions. The first to publish a lengthy response (that StorageMojo could find) is Joerg Moellenkamp in a blog post. Laudable in pointing design flaws for fundamentally 2 different markets. Sure, Sun's hardware is a great piece of engineering, squarely aimed at the enterprise market. Which, incidentally, is not buying in droves and Sun's financials is clearly reflecting that. Backblaze took the google route for storage and it's hard to see, given the competitive pressure, how they would be better off spending their margin on Sun hardware. The era of gold-plated hardware is slowly drawing to a close and I can't say I oppose that change.

A sensible approach to source code branching

Source code branching is one of the most contentious activity that you can engage in a software company. For some reason that's eluding me, I keep hearing the same arguments over and over again about why we should not use branches, about how branching is hard. It's not, neither conceptually, nor practically, it simply requires to be methodical and to overcome a visceral fear of the *Merge*. It works more or less with all current tools, with CVS probably the hardest to deal with and the last batch of distributed source control, the easiest. One of the primary problems that Feature Crews address is the difficulty of maintaining the integrity of very large code bases under development (imagine 1000 developers coding against a 10,000,000 line system). FC poses the problem as the tension between a) keeping the main branch as current as possible, and b) keeping the main branch as robust as possible. The FC solution is to make features an atomic transaction. A feature is either 0% complete or 100% complete, and a feature is not 100% complete until it can be demonstrated that it satisfies the same quality criteria as the rest of the main branch. Here's an excerpt from Lean Software Engineering. FC in this context means "feature crew". "Features-in-process are not allowed on the main branch. The FC alternative is branch-by-feature. A crew takes a branch when it takes possession of the feature kanban. The crew is responsible for forward-integrating any changes that are checked into main while their feature is in process. That is, if another crew integrates and breaks your feature-in-process, it’s your responsibility, not theirs. When your feature is finally complete AND you have integrated with all changes on main AND you pass all of the quality gates, THEN you can reverse integrate your feature into the main branch, and everybody else will have to forward integrate your changes." Here it is: use branches extensively, merge back and forth. It takes some time, a bit of practice, but it puts to rest these endless discussions about whether we should branch, when and what for.

Attempts at using a SunFire x4500 "Thumper" as an iSCSI SAN

For development purposes I was looking for a "cheap" upgrade over our aging Apple XServeRAID fiber channel arrays. These served us well, if one excludes the lack of LUN masking in the later firmware versions, but we have consistently outgrown their native capacity. Besides Apple (and its dismal enterprise support) has stopped selling them so we have been left with no choice but to look elsewhere. The basic requirements are:
  1. Fiber Channel or iSCSI target support to support database workloads
  2. Ability to carve LUNs out of a pool that is large enough (at least 20 TB of raw storage)
  3. Ability to clone volumes
  4. Ability to take snapshots within seconds
There are a few candidates that fullfil this bill: Equallogic, Sun Thumper, 3PAR, Compellent, the XServe replacement from Promise coupled with LVM, various Overland devices. The general hotness of ZFS and the "Try-n-buy" program from Sun made it acceptable to give their hardware and software a try. After all Solaris is not that different from linux (or should I say commute both terms), the hardware is dirt cheap (you can't get much cheaper) and Sun's expertise with hardware systems based on Opterons has been proven in-house on their wonderful SunFire x4600. To make a long story short, the iSCSI target daemon on Solaris 10 is not stable enough for production use. We were plagued with numerous core dumps, causing the iSCSI setup to flickr and initiators to moderately appreciate the frequent interrupts. Setting up OpenSolaris seemed to help a bit but our trust in the stability of the code had suffered an irremediable blow (well, not quite irremediable, but at least for another year or so). Pressed for time I have decided to cut our losses short and to spend more money to get a 48 TB 3PAR E200, vastly more expensive per TB but also known to work. I really wished the Thumper trial had been successful; I believe in OpenSolaris, what I've seen of ZFS makes me green with envy (compared to ext3 + LVM) but I simply cannot justify downtime and/or the hiring of a Solaris core code guru to troubleshoot this mess.

Velocity: John Allpaw @flickr, Capacity Planning

What can cause downtime:
  1. bugs
  2. edge cases
  3. security incidents
  4. real capacity problems
Deployment and management tricks from the HPC world: ganglia, System Imager Gather metrics of course, and build models, ideally out of live data, rather than artificial benchmarks. fityk can be used to replace excel to do curve fitting. [My guess is that R would work great for that too] Some flickr stats: 12,629 nagios checks, 1314 hosts, 6 data centers, 4 photo farms, 3.5-4.5 TB consumed per day. [So flickr uses nagios + ganglia] One key trick is to build kill switches in all the features so as to turn things off when load increases.

Velocity: Adam Bechtel @yahoo, Performance plumbing

When building a global network, you start building out knobs (usually implemented as routing policies): cost, packet loss, latency, maintenance, diversity, isolation, "special" [Really funny analogy between anycast and toilets, caching and water supply] After having developed routing policies, you start looking into anycast. One of the first services to be anycast is DNS. Anycast scaling: vip, ecmp Anycast considerations: how to monitor services? how to control users? how to handle transient network events?

Velocity: Rich Wolski @ucsb, EUCALYPTUS

Eucalyptus is an open-source implementation (not production-ready) of a compute cloud API-compatible with EC2. In academia sysadmin time is very expensive so the roll-out has to be really simple. Eucalyptus currently uses xen and includes a security layer that replaces Amazon's use of the credit card authentication/authorization system. Mention of ROCKS, a cluster deployment system.