How about sub-second queries in Hadoop?

Two observations from talking and listening to people during the Hadoop summit; firstly hadoop is used quite often to process clickstream data -- in all fairness I missed the talk about hadoop used for genomics. Secondly and a corollary of the first, sub-second queries in hive or pig are not quite there yet. Since a hive query translate into maps and reductions their scheduling determines in addition to the sheer volume of data is going to determine response time. Undoubtedly pre-computing aggregates is a natural way to go much like what is done for data warehouses. Where these aggregated should be stored for consumption is a problem that could to hybrid solutions. Process data with hadoop and export then to postgres or infobright to enjoy a more mature (but less scalable) run-time environment. Get multi-terabyte daily processing and sub-second analytics and all that open source. If you've done something like that, I'd be interested to know before I embark on a route where others have failed before.

twiki is great... twiki is not so great.

To organize our internal IT information we have been using twiki. It is a very flexible tool by virtue of being a wiki and has two critical features out of the box that other wikis seem to lack:
  1. Forms
  2. and a fairly interesting search directive (%SEARCH%)
If you are not using these with twiki you are missing out; the analogy is using Word without styles. You can do without but life is so much easier with them. Forms brings structure to wiki pages and allow to treat wiki pages as a structured record (the form) with a big, free-form description field (the page). For instance our twiki implementation records hosts, hardware items, services, change requests, incident tickets all with the use of custom forms, so as to produce a pseudo-relational database on which we build reports. Examples of reports:
  1. list of all change requests awaiting peer review before approval
  2. list of all hosts assigned to a given project
  3. list of all hosts running on a given piece of hardware
The list goes on. Then we start having questions such as "which are the hardware pieces whose leases end in the next 3 months?" or "how many hosts run RedHat 4.5?". And that is when twiki breaks... Its reliance on a file-based scheme (and rcs) to maintain relationship imposes some unwelcome limitations, not to mention a level of performance that is difficult to accept on a daily basis (I know that caching is in the works but it is just not built to scale). Case in point: we define hosts (think linux hosts) as compute resources that execute on some physical substrate (think IBM x3550) so it is only natural that the host form has a mention for the hardware item it executes on. In other words there is a one-to-n relationship between hardware item and host. On the hardware item form we do not feature the list of hosts that live on that hardware item because chances of dangling pointers are too great. We used to have it and quickly we ended up with hosts that point to a piece of hardware, which itself does not point back to these hosts. In other words we have had to limit the type of reports we can run because the underlying data implementation of twiki is lacking. Questions such as "Which hardware items are home to more than 3 hosts?" become unnecessarily complicated, whereas with the proper framework it becomes as simple as: select hi.name, count(*) from hardware_item hi join harware_host ho on (hi.sid = ho.hardware_sid) group by hi.name having count(*) >= 3 How about the list of potential single points of failure for a given service: select max(h.name) as hostname, hc.name as host_class from service s join service_host so on (s.sid = so.service_sid) join host h on (so.host_sid = h.sid) join host_class hc on (h.class_sid = hc.sid) where s.name = "My critical service" group by hc.name having count(*) < 2; Now, assuming I have such a relational database that twiki can query via a sql module, how different is it from the database that my monitoring package is based upon? In the ideal world my data model presents something that:
  • monitoring can use (service dependencies, host maps, etc.)
  • configuration management can use (change requests bound to given hosts, software items, etc.)
  • asset management can use
  • finance can use
The key properties that I would want such a system to keep are the ease of use with which it be manipulated (nothing more cumbersome that twiki) and its accuracy (no duplicate data). At the same time I have not found any product out there fits the bill (a monitoring package that has a solid data model that be extended for other uses). So I might just bite the bullet and build a prototypical ERP for IT. Stay tuned.

It's all about data

Most likely the service you are providing is heavily relying on persistent data, some (or most) user-generated, often quite structured so that relationships can be drawn between them in a fairly automated fashion. When the raw datum is not structured you can structure its meta-data. For example, consider the various multimedia repositories (flickr, youtube, dailymotion, etc.) The raw datum (photo, video) does not have much in the way of structure hence the emphasis its meta-data: tags, location, user ownership, etc. Isolating the right amount of structure to drive your data model is key to correctness, performance and scalability because it is really a translation of your analysis of the problem at hand. Fortunately that's really the hard part of the whole process. Once you have a data model that seems to fit your use cases, you can delegate most of the implementation details to a database. I am going to consider relational databases here because they cover a good number of use cases by virtue of being built on top of set theory. They require a discrete (and generally low) number of types in your model, each type being possibly represented by a large number of identifiable instances. To be sure some problems are not easily represented as relations (e.g. highly heterogeneous data, such as data commonly stored as files on a computer) or are not easy to optimize in the relational model (e.g. graphs). But the common stuff lends itself well to using relational databases. You will find that time spent in learning relational databases is time well invested. A relational database
  1. frees your mind from worrying too much about storage specifics
  2. lets you code your application in a portable way
  3. offers excellent performance
So if you find yourself implementing sorting, filtering of large data sets in your application code, you are not properly using your database, at least until you have reached a respectable volume of data (more on this in later posts). The typical misuse of the database involves something like this:
  1. load items in your business logic layer with minimal filtering, one by one
  2. for each entity, traverse the rest to recreate relationships
  3. discard all when finished
  4. observe rapidly degrading runtime performance as the number of items grows
Object-relational mappers (ORM) tend to favour this kind of excess, by re-inventing a watered-down, clumsy query language that is half as expressive as good ole' SQL. Except for the simplest queries of your data set I recommend going the other way round: start from your relational queries and build objects from them. Your throughput is going to scale much more easily with the number of items than if you recreate the same steps in your business logic code. Not too mention the correctness that a normalized model buys you for free, or the ability to operate on data in a transactional manner. There is a limit to this linear growth of course but until you outgrow fast and (reasonably) cheap machines such as the Sun x4600 M2, your performance will scale. Why is a relational database likely to scale better than your code? Because a large number of smart people have spent the last 30-some years optimizing the very same operations that you perform on your data (projections, filters, updates, creations). If the excellent SQL Cookbook from my friend Anthony does not convince you that SQL can be an elegant approach to data problems, I don't know what will.