How about sub-second queries in Hadoop?

Two observations from talking and listening to people during the Hadoop summit; firstly hadoop is used quite often to process clickstream data -- in all fairness I missed the talk about hadoop used for genomics. Secondly and a corollary of the first, sub-second queries in hive or pig are not quite there yet. Since a hive query translate into maps and reductions their scheduling determines in addition to the sheer volume of data is going to determine response time. Undoubtedly pre-computing aggregates is a natural way to go much like what is done for data warehouses. Where these aggregated should be stored for consumption is a problem that could to hybrid solutions. Process data with hadoop and export then to postgres or infobright to enjoy a more mature (but less scalable) run-time environment. Get multi-terabyte daily processing and sub-second analytics and all that open source. If you've done something like that, I'd be interested to know before I embark on a route where others have failed before.

Notes from the 2009 Hadoop Summit West

I just got back from Santa Clara where Yahoo and Cloudera were hosting the 2009 Hadoop Summit West on Wednesday followed by a training on Thursday. My interest was one of a prospective user -- to gauge how real and mature hadoop is. The turn-out was more than decent, in the hundreds; a number from Yahoo, running the largest clusters so far, a few folks from Amazon, Facebook, some local universities and a fair number of small companies that have deployed their own clusters (or are running on EC2). The good news first, hadoop is real and it's getting real use. It's clearly a promising platform with active use and development. The scaling model is fairly simple: buy more machines. The current sweet spot is dual-quad hosts with 4x1TB drives and 16GB or so of ECC RAM. Decoupling storage from a central system (à la SAN) is the way to go. Some folks have tried to hook up Thumpers to Niagara chips that run a lot of threads in parallel with some success but the TCO question is unclear. Hence we can start with a handful of cheap machines and go from there. A few things to watch for: the secondary name node for instance, is there here for backup but to persist the DFS layout structures that exist in RAM on the primary name node. It could have been implemented in a more robust fashion using a sql database rather than requiring a re-implementation of redo logs and data files. That's overall the negative point: applications built on the platform (such as hive, hbase and pig) are still pretty much works in progress, somewhat duplication functionality. There is an air of Not Invented Here that still pervades but it's a sign that the whole thing is still young. A vocal user base that meets regularly should help the project focus on the pieces that truly do not exist yet.

Kaizen this, kaizen that...

From continuations:
Another basic tenet of Kaizen is that inventory is evil.  Inventory is what folks use to cover up the problems and inefficiencies in their production process. Kaizen considers a significant additional cost that is much harder to quantify — the cost of allowing problems and inefficiencies in production to persist. Kaizen thinks of inventories like a lake.  When you drain the swamp you will find the “bodies” that lie buried below the surface of the water.  I believe in software development the analogy to inventory is excessive hardware. [...]
At first glance I liked that analogy, it provides a quasi-artistic view of building and running computer systems: you should strive for simplicity almost for simplicity's sake; you're done when you can't remove anything without impairing the primary function of the system. It's zen gardening or the art of computer system production; more (software) with less (hardware). Oscar Wilde would have said "Optimization for optimization's sake". On second thought I don't like the analogy that much, especially the bit about hardware. A) it assumes some form of steady state in the production, a assumption which industrial manufacturing very much verifies. You take a design that works and change it a bit so long as it works (better). Christopher Alexander in his notes on the synthesis of form explains this much better than I do. Except that more often than not, you throw more hardware at a problem because you're not in steady state (i.e. explosive growth) and because it's inherently cheaper than re-architecting a platform. Being able to throw more hardware at a problem and seeing worthy improvements is also a sign of good design, it means that you can ride the commodity wave. B) I'm never too excited about distilling a method of action based on a "cost that is much harder to quantify". If it is solely qualitative it can easily become subjective, and from subjective to ideological there is not much of a gap. And the next thing you know, you're forgoing easy parallelism because, somehow, Kaizen comes to signify that throwing more hardware at a problem is a bad sign. Don't get me wrong. I don't condone wasting cycles that could be save. Throwing more hardware at a problem is bad for obvious environmental reasons and I welcome any move to optimize software so that it runs on low-power hardware, but I'm digressing...

Structure08, my impressions so far

It's a day packed with keynotes, panels and shmoozing, with some topics overlapping with Velocity; yet at a much higher level. We've alternated between interesting panels ("Harnessing explosive growth") where the key points are:
  1. a proper architecture lets you scale [much like in traditional building]
  2. build kill switches in all your features
  3. get operations and development on a symbiotic relationship [salesforce and amazon do it]
Some other panels are clearly more about pushing your product ("The race to the next database"). The topic of processing data (possibly in the cloud) is of course crucial yet very few concerns around switching costs, security and privacy are addressed. My take on this is that if you need to run analytics on your data sets and said data sets are huge, you need compute to be close (from a network distance perspective) to your data. Which means that your data must be in the cloud. While I'm reluctant to go down that route right now, Greg Papadopoulos @sun made the compelling analogy that money storage is delegated to reputable third-parties (called banks) so data are likely to follow the same treatment, i.e. the cloud is likely to become the most secure place to store data (or most resilient with an acceptable security policy). Sun's interesting take on cloud computing is Project Caroline, where infrastructure, including network bits, is driven by code, in a way, that's presumable a bit cleaner than EC2 (which is quite bare). Dr. Vogel's presentation @amzn, was inspiring despite containing basically little new information but fits well into this type of conference, which act as reinforcement devices to jumpstart a new industry. Live coverage is at gigaom.

Velocity: John Allpaw @flickr, Capacity Planning

What can cause downtime:
  1. bugs
  2. edge cases
  3. security incidents
  4. real capacity problems
Deployment and management tricks from the HPC world: ganglia, System Imager Gather metrics of course, and build models, ideally out of live data, rather than artificial benchmarks. fityk can be used to replace excel to do curve fitting. [My guess is that R would work great for that too] Some flickr stats: 12,629 nagios checks, 1314 hosts, 6 data centers, 4 photo farms, 3.5-4.5 TB consumed per day. [So flickr uses nagios + ganglia] One key trick is to build kill switches in all the features so as to turn things off when load increases.

Velocity: Jiffy, open-source performance measurement

Scott Ruthfield, Whitepages.com, a people search company with 2 bn searches per year, 500 requests/s at peak. Initial analysis: 8s to return results, sub-second to actually get the data. What's the source of the slow-down? Possible candidates:
  1. Ads
  2. Microsoft Virtual Earth
  3. Content generated from the results
Toolset (Gomez networks) is not good enough because of poor sampling (20 samples per hours, compared to 1.3 MM requests) [should quantify error margin here, presumably high assuming a normal distribution] Introducting Jiffy. Objectives: measure anything, with little impact on page performance. Architecture starts with a jiffy.js that generates logs, then loaded into a DB and rolled up. Basic tenet: mark and measure. One mark, multiple measures. Miscellaneous features: immediate or batch submits (to not overload measurement system), default browser event measurements (onload, etc.) Bill Scott @netflix put together a firebug plug-in to capture client-side data. Source: code.whitepages.com

Velocity: Green Data Centers by Bill Coleman

Notes taken from the floor Problem with current data centers: rising energy costs, increased complexity. Current solution: automate further, just pushing back. What's started to happen is an "inflexion point" [I'm not sure I see why this mathemaical term has been chosen], the ability of anyone on the world to be connected to anything in the world, we're getting there. The current cloud: 1.0, IT-centric, used to build proprietary applications. 2.0, store everything on the cloud, with security but still proprietary. Commoditization is unstoppable and is happening in the next decade. How do I get started with green data centers?  Firstly you can shut down servers as soon as you've figured out which ones to turn off. The problem is to find out dependencies and shut down the right servers. Why? Save money but the overarching goal is to drive automation by policy [presumably requiring an ontology to let systems know about themselves]. Average utilization percent for VMs is for 50,000 virtual machines is barely 20%, which compared with mainframe utilization figures is quite low (80%).

Erlang is at long last getting the break it deserves.

Facebook chat is a heavy erlang user (so is SimpleDB). Erlang is one of these languages that open your eyes to a new way of programming. Eight years ago, shortly after it was open-sourced, I used it to build a reliable message passing system for a small start-up (that never quite made it). I remember being in awe of 3 features:
  1. The explicit inclusion of time in the language, this is probably the killer feature. You can write elegant program that expect events to happen within a certain timeframe and react if no events show up. Because it is an integral part of the language you have to think about failure and how to handle it.
  2. Hot code upgrades; it was hot in 1999, it's still hot. Build systems with the aim of zero-downtime, even for releases. With share-nothing architecture this might be less relevant now, but there is often a little shared state that creeps in and requires high-availability of a core component.
  3. Service dependency; a service is built out of components that must be functional for the service to be rendered properly. One often ends up slapping an external monitoring layer on top of the whole thing and kludgy scripts to restart components the best one can based on the data available to the external monitoring layer. With erlang, it's all in the box, no tools required.
Nice to see a great piece of engineering (you have to read to the book "Programming in Erlang") getting the exposure it deserves.