When building a global network, you start building out knobs (usually implemented as routing policies):
cost, packet loss, latency, maintenance, diversity, isolation, "special"
[Really funny analogy between anycast and toilets, caching and water supply]
After having developed routing policies, you start looking into anycast. One of the first services to be anycast is DNS.
Anycast scaling: vip, ecmp
Anycast considerations: how to monitor services? how to control users? how to handle transient network events?
Panelists: presented by Adam Jacob (HJK Solutions), Shayan Zadeh (Zoosk, Inc. ), Brian Moon (dealnews.com), Don MacAskill (SmugMug), John Allspaw (Flickr (Yahoo!)), Michael Halligan (BitPusher, LLC) and a gentleman (Fotolog)
Don McAskill: Rafael Nadal started to win Roland-Garros and his fanclub was there. He won the Open, which created a huge spike. Comments had to be turned off for the site to survive. The next year, he won again and stats had to be turned off. For his third victory servers did not collapse. This year he won and we did not even register.
John Allspaw: code gets pushed 20 to 30 times a day... Major events triggered traffic spikes.
Don would love to not operate a data center anymore, despite their expertise.
John: DB problems are hard [everyone in agreement, myself included]
[Discussion follows on scalablity: do not optimize for scale too early]
Don: EC2 is not worth it for servers that run around the clock, but if you're good at shutting down instances that you don't need.
Strategy: buy lots of commodity hardware, because problems tend to be too big for their problem space. Hardware reliability is not that useful as well because it's expensive.
[Showing the same pictures over and over again, someone from Google PR, please authorize the release of newer pictures]
[A GFS description follows, nothing new so far, read the papers on the topic]
[A BigTable description follows, same deal]
I wish this talk had some new information...
Artur works for Wikia. WoWWikia is the 2nd largest wiki around.
Value of performance and reliability is around
WoW: $520 MM of profit per year, 99% reliable but users expect it, so it's really about setting expectations.
Operations is about using resources efficiently, reliably and has to be measured against revenues from user and the value of downtime (which must be computed): e.g. cost per page served is vital to guide decisions.
Example from wikia: 20% of all wiki pages went up from 200ms to 15s to load, 35% of pages were slow [per session] but that led to a 15% reduction of "fast" pages viewed, which has a clear cost.
Launched a project with 3 engineers for 4 weeks to improve page performance. Yielded good results but ads network is slowing down the whole thing. Since ads use document.write, wiki overrides it to allow for pages to load without waiting for ads to finish loading. This lead to more pageviews, but about 20% ads are not even loaded (network time-out, users clicks away).