Velocity: John Allpaw @flickr, Capacity Planning

What can cause downtime:
  1. bugs
  2. edge cases
  3. security incidents
  4. real capacity problems
Deployment and management tricks from the HPC world: ganglia, System Imager Gather metrics of course, and build models, ideally out of live data, rather than artificial benchmarks. fityk can be used to replace excel to do curve fitting. [My guess is that R would work great for that too] Some flickr stats: 12,629 nagios checks, 1314 hosts, 6 data centers, 4 photo farms, 3.5-4.5 TB consumed per day. [So flickr uses nagios + ganglia] One key trick is to build kill switches in all the features so as to turn things off when load increases.

Velocity: Panel, a survival guide

Panelists: presented by Adam Jacob (HJK Solutions), Shayan Zadeh (Zoosk, Inc. ), Brian Moon (dealnews.com), Don MacAskill (SmugMug), John Allspaw (Flickr (Yahoo!)), Michael Halligan (BitPusher, LLC) and a gentleman (Fotolog) Don McAskill: Rafael Nadal started to win Roland-Garros and his fanclub was there. He won the Open, which created a huge spike. Comments had to be turned off for the site to survive. The next year, he won again and stats had to be turned off. For his third victory servers did not collapse. This year he won and we did not even register. John Allspaw: code gets pushed 20 to 30 times a day... Major events triggered traffic spikes. Don would love to not operate a data center anymore, despite their expertise. John: DB problems are hard [everyone in agreement, myself included] [Discussion follows on scalablity: do not optimize for scale too early] Don: EC2 is not worth it for servers that run around the clock, but if you're good at shutting down instances that you don't need.