How about sub-second queries in Hadoop?

Two observations from talking and listening to people during the Hadoop summit; firstly hadoop is used quite often to process clickstream data -- in all fairness I missed the talk about hadoop used for genomics. Secondly and a corollary of the first, sub-second queries in hive or pig are not quite there yet. Since a hive query translate into maps and reductions their scheduling determines in addition to the sheer volume of data is going to determine response time. Undoubtedly pre-computing aggregates is a natural way to go much like what is done for data warehouses. Where these aggregated should be stored for consumption is a problem that could to hybrid solutions. Process data with hadoop and export then to postgres or infobright to enjoy a more mature (but less scalable) run-time environment. Get multi-terabyte daily processing and sub-second analytics and all that open source. If you've done something like that, I'd be interested to know before I embark on a route where others have failed before.

Video on Hadoop from Yahoo!

http://us.dl1.yimg.com/download.yahoo.com/dl/ydn/hadoop.m4v Key points:
  1. cpu, ram size, i/o bandwidth increase exponentially; hard drive seek times do not.
  2. Relational databases and their b-tree datastructures require ln(n) seeks as a crude simplification.
  3. Sort/Merge algorithms working on flat files operate as function of the transfer rates or bandwidth, not seeks (ln(n) is mentioned but I'd think it's at least n.ln(n).
  4. Flat files allow data to not conform to a preconceived schema and is good for exploration
  5. Commodity PCs offer the best computation bang for your bucks but failures are bound to occur frequently; that's the google model of scaling out
Overall this reinforces the idea that MapReduce and its ilk are best-suited for loosely-structured data, or rather data which one does not know how to query and structure beforehand.