Interesting EC2 DNS bug

EC2's internal DNS servers don't get updates when you stop and restart EBS-backed instances. I came across this bug as I was trying to get the scala off-line compiler to work on a restarted instance. fsc uses java.net.InetAddress.getLocalhost(), which triggers a DNS call. After some time spent reading the code, a tcpdump session convinced me that the machine thought it was something else (at least at the DNS level). Call it split personality. To reproduce:
  1. start an EBS-backed instance
  2. note its name and its internal ip (uname -n, ip addr)
  3. stop and restart the instance
  4. its node name remains unchanged, its ip has changed, yet dig +short instance_dns_name returns the old IP, even hours after the restart
Annoying!

#structure09 Hosting on commodity hardware

I just got out of the panel on commodity hardware and did not get a chance to participate so here's my take on it. The panel started with an opening question: google, amazon and the likes run at a huge scale on commodity hardware, yet enterprise vendors still push customized hardware and expensive at that. To me the answer is pretty obvious: enterprise hardware is being for the most part sold to people who don't know how to architect and design software on a commoditized stack. Let's be honest, look at most "enterprise" hardware/software literature: it's just noise and a waste of both the writer's and the reader's time. And by stack I mean from the server, all the way up to the application code. If you constrain yourself to buy servers that cost no more than $5k, buying high-end database software makes little sense. Rather you recognize that low-end compute is how you get economies of scale and you apply the same reasoning to your networking gear, storage systems, database software, load balancing software, etc. Google, from its earlier papers, seems to be the first to have understood that, rejecting the usual marketing garbage from large vendors. And for that we should be grateful.

Fun but not pratical: cloud computing steganography

Differential Power Analysis is a neat way to cryptanalyze smart cards and that triggered an interesting counter-measure: keeping power consumption constant regardless of the computation performed. Moving to a bigger scale and assuming low cloud compute costs, one could hide sensitive data processing in one VM by running ninety-nine others with slightly different data, whose results will be discarded silently.

#Cloudcamp: storing sensitive data in a public cloud

Yesterday at CloudCamp, a few of us discussed methods to store and use sensitive data on a public cloud, where you presumably do not have strong assurances that your data are for your eyes only. To keep structured data (e.g. relational data) a pattern emerged among participants assuming your data have an easy key:
  1. Store all non-identifiable data in the cloud, keyed by an arbitrary identifier.
  2. Keep actual identifiable data on-premises with the mapping to that arbitrary identifier.
  3. Let the client device resolve the mapping locally.
For instance suppose you store transactions, this would require at a minimum the scrambling of the transaction details such as item name and party name. That offers mitigation against simple analysis of the data, using statistical methods to derive information, which can be acceptable. Of course everyonen remembers the AOL search term fiasco, where people could be identified based on the search terms. Which is why this scheme should work best if the data are highly structured.

Open cloud manifesto, not much radicalism here

The manifesto triggered copious traffic thanks to the backroom-smoke-filled air of its inception. I wish the same could be said about its less-than-radical contents. If you were expecting a stated vision about the cloud as the substrate of all future computing that's not a mobile phone or nettop, no such thing there. It sounds more like the cries of small players about to be crushed by the non-signatory parts, i.e. Amazon, Google and Microsoft.

"The illusion of unlimited supply..."

Berkeley's Reliable, Adaptable, Distributed Systems Lab has produced a nice synthesis of the current technological underpinning of cloud computing, in a paper called "Above the clouds". StorageMojo and Perspectives have done a fine exegesis of the paper so I thought I'd focus on a claim that has caught my attention. The claim That claim is the fundamental premise of perceived, unlimited compute and storage elasticity: "The illusion of infinite computing resources available on demand, thereby eliminating the need for Cloud Computing users to plan far ahead for provisioning". I would argue that this illusion has to be dispelled for a stable and long-term development of the clouds. Being in its infancy demand is still fairly limited (Amazon claims 400,000 registered AWS customers) so as a customer I can operate on the assumption that any individual demand does not significantly affect supply. To make sure that individual demand does not rock the boat, a provider such as Amazon rations the amount of resources available  to any customer, so that fluctuation in demand can be absorbed by their resource allocator. To go beyond that limit requires to enter a longer discussion with the provider, so that it can ensure that its resource allocator will handle the peaks and troughs and demand. This is a classic production/price control scheme. Amazon Reserved instances Recently Amazon has introduced the concept of “reserved instances”; a one-time payment per instance opens allows for lower per-hour charges. Presumably that one-time payment, while still profitable, allows the provider to better predict future sustained demand. Firstly it is only worth getting a reserved instance if you plan to use it:
  • more than 193 straight days for a year or,
  • more than 99 days per year for 3 years.
193 straight days is really 12 hours a day every day of the year if your business is cluttered around a few timezones. A three-year contract gives you 660 days of free run-time compared to signing up for 3 consecutive years. It’s a clever move. Let the customer pre-pay all or part of the fully-burdened, marginal cost of an instance, yet retain a variable part so that it does not turn into a all-you-can-eat feast that would devour margins. Also consider that in the Amazon Web Services (tm) Customer Agreement, “reserved instances” are really about “reserved instance pricing”, not about Amazon reserving capacity to serve these instances, their Service Level Agreement notwithstanding. This should allow Amazon to achieve at least 3 objectives:
  1. better plan for capacity with at least part of the marginal cost of instances, pre-paid,
  2. give corporate customers a greater perception of control and security, however tenuous it is in reality,
  3. differentiate corporate customers (usage patterns, utilization) and tweak the resource allocator to that purpose.
As a side note I have not found a clause restricting the reselling of reserved instances though margins would be low and eaten away by the payment and billing system used by the reseller (e.g. Amazon FPS). Resource allocation For the provider this is an interesting and crucial problem to solve, that of properly oversubscribing their actual physical resources (physical cpus, physical drives, physical network pipes). To oversubscribe physical resources through virtualization at a large scale is after all, what the cloud is about. It's also the name of the game in traditional banking: loan 9 times more money than you have in trusted assets. So as a provider, using the circulated figure of a "best-case" 30% compute utilization, with 100 physical compute units I should be able to lease at most 300 units. This is of course simplistic and there is a wealth of literature on the topic of oversubscribing commodities, be it money, phone minutes, bandwidth or plane seats. All providers being for-profit enterprises we can foresee that they are going to drive their oversubscription to the limit and the winners are likely going to be driven by 2 factors:
  • the marginal cost of physical resources (compute, storage, network) and management overhead for each,
  • the sophistication of the resource allocator.
The first factor, the marginal cost of physical resources, is, as the authors of the paper point out, varies with the inverse of scale. The bigger you get, the cheaper it becomes to operate your data centers, the current figures documented call for a 5x-7x reduction in cost when you reach internet scale. Clearly this is not going to leave a lot of breathing room for the mom-and-pop operations. The second factor is by and large, independent of scale. How high you can drive your data center utilization as a cloud computing provider, without running out of resources, is going to make a significant difference once you've reached internet scale. The crystal ball Since this market is still young and because the barrier of entry to reach a profitable zone I suspect that the smaller players that do not offer a lot of added value on top of computing infrastructure leasing will have to drive their resource allocator very aggressively; too aggressively, which means that they’ll end up losing customers because in the end oversubscription will reach unsustainable levels. Once the small operators are out of the picture, the 3 major players will either cartelize the market or contend over price by deploying a better resource allocator. In the former case we would be looking at a situation analogous to the oil market or the U.S. ISP market; the illusion of competition shrouding an illicit price fixing agreement. In the latter case, assuming a greater shift on cloud computing, the best resource allocator will have to buffer demand spikes and maintain a high utilization. Make switching from one provider to another swift, public and reversible, and you have a public market where prices are not driven by supply and demand (who in 2009 can still believe that?) but also by trust and perception. Without more transparency to let the public get a sense of inventories (e.g. through a clearinghouse), who’s to say we won’t see another bubble?

Structure08, my impressions so far

It's a day packed with keynotes, panels and shmoozing, with some topics overlapping with Velocity; yet at a much higher level. We've alternated between interesting panels ("Harnessing explosive growth") where the key points are:
  1. a proper architecture lets you scale [much like in traditional building]
  2. build kill switches in all your features
  3. get operations and development on a symbiotic relationship [salesforce and amazon do it]
Some other panels are clearly more about pushing your product ("The race to the next database"). The topic of processing data (possibly in the cloud) is of course crucial yet very few concerns around switching costs, security and privacy are addressed. My take on this is that if you need to run analytics on your data sets and said data sets are huge, you need compute to be close (from a network distance perspective) to your data. Which means that your data must be in the cloud. While I'm reluctant to go down that route right now, Greg Papadopoulos @sun made the compelling analogy that money storage is delegated to reputable third-parties (called banks) so data are likely to follow the same treatment, i.e. the cloud is likely to become the most secure place to store data (or most resilient with an acceptable security policy). Sun's interesting take on cloud computing is Project Caroline, where infrastructure, including network bits, is driven by code, in a way, that's presumable a bit cleaner than EC2 (which is quite bare). Dr. Vogel's presentation @amzn, was inspiring despite containing basically little new information but fits well into this type of conference, which act as reinforcement devices to jumpstart a new industry. Live coverage is at gigaom.