#structure09 Hosting on commodity hardware

I just got out of the panel on commodity hardware and did not get a chance to participate so here's my take on it. The panel started with an opening question: google, amazon and the likes run at a huge scale on commodity hardware, yet enterprise vendors still push customized hardware and expensive at that. To me the answer is pretty obvious: enterprise hardware is being for the most part sold to people who don't know how to architect and design software on a commoditized stack. Let's be honest, look at most "enterprise" hardware/software literature: it's just noise and a waste of both the writer's and the reader's time. And by stack I mean from the server, all the way up to the application code. If you constrain yourself to buy servers that cost no more than $5k, buying high-end database software makes little sense. Rather you recognize that low-end compute is how you get economies of scale and you apply the same reasoning to your networking gear, storage systems, database software, load balancing software, etc. Google, from its earlier papers, seems to be the first to have understood that, rejecting the usual marketing garbage from large vendors. And for that we should be grateful.

How about sub-second queries in Hadoop?

Two observations from talking and listening to people during the Hadoop summit; firstly hadoop is used quite often to process clickstream data -- in all fairness I missed the talk about hadoop used for genomics. Secondly and a corollary of the first, sub-second queries in hive or pig are not quite there yet. Since a hive query translate into maps and reductions their scheduling determines in addition to the sheer volume of data is going to determine response time. Undoubtedly pre-computing aggregates is a natural way to go much like what is done for data warehouses. Where these aggregated should be stored for consumption is a problem that could to hybrid solutions. Process data with hadoop and export then to postgres or infobright to enjoy a more mature (but less scalable) run-time environment. Get multi-terabyte daily processing and sub-second analytics and all that open source. If you've done something like that, I'd be interested to know before I embark on a route where others have failed before.

Notes from the 2009 Hadoop Summit West

I just got back from Santa Clara where Yahoo and Cloudera were hosting the 2009 Hadoop Summit West on Wednesday followed by a training on Thursday. My interest was one of a prospective user -- to gauge how real and mature hadoop is. The turn-out was more than decent, in the hundreds; a number from Yahoo, running the largest clusters so far, a few folks from Amazon, Facebook, some local universities and a fair number of small companies that have deployed their own clusters (or are running on EC2). The good news first, hadoop is real and it's getting real use. It's clearly a promising platform with active use and development. The scaling model is fairly simple: buy more machines. The current sweet spot is dual-quad hosts with 4x1TB drives and 16GB or so of ECC RAM. Decoupling storage from a central system (à la SAN) is the way to go. Some folks have tried to hook up Thumpers to Niagara chips that run a lot of threads in parallel with some success but the TCO question is unclear. Hence we can start with a handful of cheap machines and go from there. A few things to watch for: the secondary name node for instance, is there here for backup but to persist the DFS layout structures that exist in RAM on the primary name node. It could have been implemented in a more robust fashion using a sql database rather than requiring a re-implementation of redo logs and data files. That's overall the negative point: applications built on the platform (such as hive, hbase and pig) are still pretty much works in progress, somewhat duplication functionality. There is an air of Not Invented Here that still pervades but it's a sign that the whole thing is still young. A vocal user base that meets regularly should help the project focus on the pieces that truly do not exist yet.

IPSec throughput tests between 2 Soekris 5501 running pfsense

Late last year I was playing with pfsense in order to replace ssh+vtund connections between sites with a cleaner ipsec rig. To that effect I set up 2 soekris 5501 with HiFn crypto accelerators, directly connected via a Cat-6 ethernet cable, both running pfsense-1.2 (I forget which release candidate) and was able to pipe 20Mb/s using 256 bit-AES ESP (note the little b as bit, not byte). I controlled for ethernet limitation by sending 8x-10x as much data over the same link without ipsec.

What's happened in the last four months

I have been busy building a colocated platform for a client of ours. Early June work starts putting together a brand new development environment, which we built with VMware ESX, a host of Dell R805s and 3PAR E200. Mid-August we were able to "move" in the "production" suite with racks and power. 3 months later, we are ready to launch the full platform so that's left me with little time for non-essential things such as posting here. Now that launch is almost there, I'll be happy to distill what I've learned in the process. Stay tuned.

On continuous production

Continuous production is an idea that's probably as old as the first modern blast furnaces, an idea that decades of industrialization have perfected, an idea that has become all the more current in internet operations now that the mom-and-pop web hosting has turned into industrial scale operations. The body of literature regarding the topic of running an online service continuously without eating away profit margins is getting richer by the day, with a crucial contribution from James Hamilton (pdf of LISA paper) summarizing a wealth of traits of online operations that manage to scale from a technical and, as importantly from an economic standpoint.The peculiar nature of software is so that there are actually two distinct types of production:
  1. Production of service/software from need to formal requirements to design, development and release, the end production is a set of components/artifacts that are finished enough to fulfill users' needs (a.k.a. software development)
  2. Production of service/software from these artifacts to an actual service (a.k.a. service production)
Software engineers and their managers typically care about the former while the latter has traditionally been the realm of system engineers and operations. These 2 crowds have quite different cultures, nigh antagonistic since developers are meant to create manageable change while operations is tasked with keeping the whole thing running around the clock. And any experienced person in operations will tell you that things break, in their vast majority, when they change, when they are brought out of steady state. The aforementioned paper offers a way to bridge that gap by bringing developers much closer to operations. That's Amazon's supposed motto: "You build it, you run it". This is a cultural change that has proven natural in a tiny structure where everyone does everything but is proving harder to keep as the company grows, especially when it has not been constantly pushed as the correct way of organizing work. It has been one of my ongoing projects to make sure that we close that precious feedback loop between developers and operations.

twiki is great... twiki is not so great.

To organize our internal IT information we have been using twiki. It is a very flexible tool by virtue of being a wiki and has two critical features out of the box that other wikis seem to lack:
  1. Forms
  2. and a fairly interesting search directive (%SEARCH%)
If you are not using these with twiki you are missing out; the analogy is using Word without styles. You can do without but life is so much easier with them. Forms brings structure to wiki pages and allow to treat wiki pages as a structured record (the form) with a big, free-form description field (the page). For instance our twiki implementation records hosts, hardware items, services, change requests, incident tickets all with the use of custom forms, so as to produce a pseudo-relational database on which we build reports. Examples of reports:
  1. list of all change requests awaiting peer review before approval
  2. list of all hosts assigned to a given project
  3. list of all hosts running on a given piece of hardware
The list goes on. Then we start having questions such as "which are the hardware pieces whose leases end in the next 3 months?" or "how many hosts run RedHat 4.5?". And that is when twiki breaks... Its reliance on a file-based scheme (and rcs) to maintain relationship imposes some unwelcome limitations, not to mention a level of performance that is difficult to accept on a daily basis (I know that caching is in the works but it is just not built to scale). Case in point: we define hosts (think linux hosts) as compute resources that execute on some physical substrate (think IBM x3550) so it is only natural that the host form has a mention for the hardware item it executes on. In other words there is a one-to-n relationship between hardware item and host. On the hardware item form we do not feature the list of hosts that live on that hardware item because chances of dangling pointers are too great. We used to have it and quickly we ended up with hosts that point to a piece of hardware, which itself does not point back to these hosts. In other words we have had to limit the type of reports we can run because the underlying data implementation of twiki is lacking. Questions such as "Which hardware items are home to more than 3 hosts?" become unnecessarily complicated, whereas with the proper framework it becomes as simple as: select hi.name, count(*) from hardware_item hi join harware_host ho on (hi.sid = ho.hardware_sid) group by hi.name having count(*) >= 3 How about the list of potential single points of failure for a given service: select max(h.name) as hostname, hc.name as host_class from service s join service_host so on (s.sid = so.service_sid) join host h on (so.host_sid = h.sid) join host_class hc on (h.class_sid = hc.sid) where s.name = "My critical service" group by hc.name having count(*) < 2; Now, assuming I have such a relational database that twiki can query via a sql module, how different is it from the database that my monitoring package is based upon? In the ideal world my data model presents something that:
  • monitoring can use (service dependencies, host maps, etc.)
  • configuration management can use (change requests bound to given hosts, software items, etc.)
  • asset management can use
  • finance can use
The key properties that I would want such a system to keep are the ease of use with which it be manipulated (nothing more cumbersome that twiki) and its accuracy (no duplicate data). At the same time I have not found any product out there fits the bill (a monitoring package that has a solid data model that be extended for other uses). So I might just bite the bullet and build a prototypical ERP for IT. Stay tuned.

Great presentation from Dan Pritchett at eBay on operational manageability

Here at infoq.Dan made a number of interesting points that resonate with our experience:

The need to figure out dependencies before crisis hits

The main point here is that good software design heavily promotes resource abstraction. Databases become simple data sources that can be used without worrying about their whereabouts. So this makes dependency mapping an intricate task if it is not conducted in parallel to the development process. We have all sifted through lines and lines of J2EE thread dumps (to cite an extreme) in a moment of crisis to end up looking at 5 different configuration files and pinpoint exactly which resource is failing. Right there that requires someone who is familiar with application server internals and that someone is usually someone from the application development team rather than a system administrator.

Power crunch

Software does not get efficient quarter after quarter, rendering efforts on the hardware side negligible.

An active/active disaster recovery setup is often preferable to active/passive

The initial set-up cost in complexity and capital expenses is often by two critical factors:
  1. active/active means that both sites are tested as opposed to a passive site, which you discover at the worst moment possible, won't be able to serve without configuration changes
  2. active/passive failover is typically disruptive (lost sessions, etc.) so the business impact is higher
Another nicety of an active/active set-up is that, as you multiply the number of sites, each one requires less excess capacity to sustain the loss of 1 site. Set up 3 sites and each needs to have 150% of nominal capacity to absorb the loss of one; 4 sites means 133%, etc.

Developers have 2 groups of customers, real customers and the ops team

That is the gist of the "Release It!" book mentioned in an earlier post.

Some figures about eBay's current infrastructure

They use 5000 application servers running on commodity hardware and about 300 database servers running on mid-range Sun hardware.