Two observations from talking and listening to people during the Hadoop summit; firstly hadoop is used quite often to process clickstream data -- in all fairness I missed the talk about hadoop used for genomics. Secondly and a corollary of the first, sub-second queries in hive or pig are not quite there yet. Since a hive query translate into maps and reductions their scheduling determines in addition to the sheer volume of data is going to determine response time. Undoubtedly pre-computing aggregates is a natural way to go much like what is done for data warehouses.
Where these aggregated should be stored for consumption is a problem that could to hybrid solutions. Process data with hadoop and export then to postgres or infobright to enjoy a more mature (but less scalable) run-time environment. Get multi-terabyte daily processing and sub-second analytics and all that open source.
If you've done something like that, I'd be interested to know before I embark on a route where others have failed before.
I just got back from Santa Clara where Yahoo and Cloudera were hosting the 2009 Hadoop Summit West on Wednesday followed by a training on Thursday. My interest was one of a prospective user -- to gauge how real and mature hadoop is.
The turn-out was more than decent, in the hundreds; a number from Yahoo, running the largest clusters so far, a few folks from Amazon, Facebook, some local universities and a fair number of small companies that have deployed their own clusters (or are running on EC2).
The good news first, hadoop is real and it's getting real use. It's clearly a promising platform with active use and development. The scaling model is fairly simple: buy more machines. The current sweet spot is dual-quad hosts with 4x1TB drives and 16GB or so of ECC RAM. Decoupling storage from a central system (à la SAN) is the way to go. Some folks have tried to hook up Thumpers to Niagara chips that run a lot of threads in parallel with some success but the TCO question is unclear.
Hence we can start with a handful of cheap machines and go from there. A few things to watch for: the secondary name node for instance, is there here for backup but to persist the DFS layout structures that exist in RAM on the primary name node. It could have been implemented in a more robust fashion using a sql database rather than requiring a re-implementation of redo logs and data files.
That's overall the negative point: applications built on the platform (such as hive, hbase and pig) are still pretty much works in progress, somewhat duplication functionality. There is an air of Not Invented Here that still pervades but it's a sign that the whole thing is still young. A vocal user base that meets regularly should help the project focus on the pieces that truly do not exist yet.
Notes taken from the floor
Problem with current data centers: rising energy costs, increased complexity. Current solution: automate further, just pushing back.
What's started to happen is an "inflexion point" [I'm not sure I see why this mathemaical term has been chosen], the ability of anyone on the world to be connected to anything in the world, we're getting there.
The current cloud: 1.0, IT-centric, used to build proprietary applications. 2.0, store everything on the cloud, with security but still proprietary. Commoditization is unstoppable and is happening in the next decade.
How do I get started with green data centers? Firstly you can shut down servers as soon as you've figured out which ones to turn off. The problem is to find out dependencies and shut down the right servers. Why? Save money but the overarching goal is to drive automation by policy [presumably requiring an ontology to let systems know about themselves]. Average utilization percent for VMs is for 50,000 virtual machines is barely 20%, which compared with mainframe utilization figures is quite low (80%).