Velocity: John Allpaw @flickr, Capacity Planning

What can cause downtime:
  1. bugs
  2. edge cases
  3. security incidents
  4. real capacity problems
Deployment and management tricks from the HPC world: ganglia, System Imager Gather metrics of course, and build models, ideally out of live data, rather than artificial benchmarks. fityk can be used to replace excel to do curve fitting. [My guess is that R would work great for that too] Some flickr stats: 12,629 nagios checks, 1314 hosts, 6 data centers, 4 photo farms, 3.5-4.5 TB consumed per day. [So flickr uses nagios + ganglia] One key trick is to build kill switches in all the features so as to turn things off when load increases.

Velocity: Adam Bechtel @yahoo, Performance plumbing

When building a global network, you start building out knobs (usually implemented as routing policies): cost, packet loss, latency, maintenance, diversity, isolation, "special" [Really funny analogy between anycast and toilets, caching and water supply] After having developed routing policies, you start looking into anycast. One of the first services to be anycast is DNS. Anycast scaling: vip, ecmp Anycast considerations: how to monitor services? how to control users? how to handle transient network events?

Velocity: Panel, a survival guide

Panelists: presented by Adam Jacob (HJK Solutions), Shayan Zadeh (Zoosk, Inc. ), Brian Moon (dealnews.com), Don MacAskill (SmugMug), John Allspaw (Flickr (Yahoo!)), Michael Halligan (BitPusher, LLC) and a gentleman (Fotolog) Don McAskill: Rafael Nadal started to win Roland-Garros and his fanclub was there. He won the Open, which created a huge spike. Comments had to be turned off for the site to survive. The next year, he won again and stats had to be turned off. For his third victory servers did not collapse. This year he won and we did not even register. John Allspaw: code gets pushed 20 to 30 times a day... Major events triggered traffic spikes. Don would love to not operate a data center anymore, despite their expertise. John: DB problems are hard [everyone in agreement, myself included] [Discussion follows on scalablity: do not optimize for scale too early] Don: EC2 is not worth it for servers that run around the clock, but if you're good at shutting down instances that you don't need.

Velocity: Sean Quilan @google, Storage at scale

Strategy: buy lots of commodity hardware, because problems tend to be too big for their problem space. Hardware reliability is not that useful as well because it's expensive. [Showing the same pictures over and over again, someone from Google PR, please authorize the release of newer pictures] [A GFS description follows, nothing new so far, read the papers on the topic] [A BigTable description follows, same deal] I wish this talk had some new information...

Velocity: Rich Wolski @ucsb, EUCALYPTUS

Eucalyptus is an open-source implementation (not production-ready) of a compute cloud API-compatible with EC2. In academia sysadmin time is very expensive so the roll-out has to be really simple. Eucalyptus currently uses xen and includes a security layer that replaces Amazon's use of the credit card authentication/authorization system. Mention of ROCKS, a cluster deployment system.

Velocity: Brent Chapman @great circle, what can IT professionals learn from emergency services?

Example: a car hits a fire hydrant. Lots of agencies involved (fire dpt, ambo, police, electrical company). How do they coördinate all that? Incident Command System is the protocol used in pretty much all emergency situations (courses available here). I'll put a pointer to slides, the example used in the talk is good. The wikipedia article is supposedly good and this article from ham radio operators is a good introduction.

Velocity: Luiz Barroso @google, efficient energy ops

Hypothetical energy cost extrapolations, 5 years from now, hardware could be only 20-50% of the total energy costs. Efficiency defined as computing speed divided by power. Can be broken down further (computing speed / power provided to chip x power provided to chip / power provided to server x power provided to server / power provided to data center).
  • Data center efficiency, PUE around 1.83, worse if data center is underutilized
  • Server energy efficiency, 25% dissipated by power supply
From uptime institute, 10-year energy costs, $9/W for consumption, $10-22/W for data center build out. Rough cost breakdown: 50% on hardware, 22% on energy, 28% on  data center (assumptions, dual socket x86, 4 year depreciation, 70% load at peak). How to be more efficient:
  1. consolidate workloads
  2. measure actual power usage rather than rely on nameplates
  3. investigate oversubscription
Oversubscription potential rises as the number of machines grows so oversubscribe at the data center level. Also mix workloads and be ready to kill instances if you get close to the limit. Source: Energy-proportional computing Consider a data center as a device (5,000 machines), distribution with 2 peaks, one at 5% utilization, another around 30%. Typical power efficiency of a typical server, a machine running at a load of 0.3 is at 60% power efficiency, while a fully loaded machine is at 100% power efficency, and sadly data center are very rarely at 100% as seen before. The idea behind energy-proportional computing: a generally proportional relation between work and power. Idleness in a server is scarce. It should happen at the electronics because in software it's much harder (think of kernel getting interrupts all the time). If you breakdown power by component, you find out that the CPU is much-more proportional than the rest of the components so even powering down the cpu the total savings are still between 10% and 20% of power gains. Still CPUs have 2 important power-usage features:
  1. wide dynamic power range (ram, disks and network devices remain in a much closer power range)
  2. active low-power modes, where the cpu can do things
People, which average around 120W, have a 20x dynamic power range, compared to a 2x of a PC. In conclusion, write fast code (biggest contribution to energy efficiency), consider reduction of all energy-related costs (provisioning), and demand energy-proportionality from equipment manufacturers. Plug: http://climatesaverscomputing.org

Velocity: Artur Bergman

Artur works for Wikia. WoWWikia is the 2nd largest wiki around. Value of performance and reliability is around WoW: $520 MM of profit per year, 99% reliable but users expect it, so it's really about setting expectations. Operations is about using resources efficiently, reliably and has to be measured against revenues from user and the value of downtime (which must be computed): e.g. cost per page served is vital to guide decisions. Example from wikia: 20% of all wiki pages went up from 200ms to 15s to load, 35% of pages were slow [per session] but that led to a 15% reduction of "fast" pages viewed, which has a clear cost. Launched a project with 3 engineers for 4 weeks to improve page performance. Yielded good results but ads network is slowing down the whole thing. Since ads use document.write, wiki overrides it to allow for pages to load without waiting for ads to finish loading. This lead to more pageviews, but about 20% ads are not even loaded (network time-out, users clicks away).

Velocity: Jiffy, open-source performance measurement

Scott Ruthfield, Whitepages.com, a people search company with 2 bn searches per year, 500 requests/s at peak. Initial analysis: 8s to return results, sub-second to actually get the data. What's the source of the slow-down? Possible candidates:
  1. Ads
  2. Microsoft Virtual Earth
  3. Content generated from the results
Toolset (Gomez networks) is not good enough because of poor sampling (20 samples per hours, compared to 1.3 MM requests) [should quantify error margin here, presumably high assuming a normal distribution] Introducting Jiffy. Objectives: measure anything, with little impact on page performance. Architecture starts with a jiffy.js that generates logs, then loaded into a DB and rolled up. Basic tenet: mark and measure. One mark, multiple measures. Miscellaneous features: immediate or batch submits (to not overload measurement system), default browser event measurements (onload, etc.) Bill Scott @netflix put together a firebug plug-in to capture client-side data. Source: code.whitepages.com