Looking into system performance of an Oracle data warehouse

Introduction

This is the start of an ongoing investigation into system performance of an oracle 10.2 data warehouse being loaded . The database server has 2 real storage volumes (called dw-clear and dw-encrypt) and 1 virtual one (dw-encrypt-u) used to decrypt data on the fly. Most of the data and the i/o are on the dw-clear volume. System-data performance have been collected via sadc -d to capture per-device statistics. The data are then extracted using sadf -d filename -- -d -b -d. The summary is available here as a csv. It's a large table of block i/o stats, cpu stats and per-device i/o stats, suitable to be imported into R. The system characteristics are as follows.
  • Sun x4150 64GB RAM, 2x4 x5450, 1 4Gb/s QL2462 HBA with 2 ports.
  • 3 device-mapper devices, 2 using a round-robin multipath (v1, v2), 1 using an on-the-fly cipher to decode encrypted data (v3).
  • 3PAR S400 with 10k drives and 4Gb/s HBAs.
  • Out of the 64GB, 8GB are set aside as HugePages to serve as memory pages for the SGA.
The goal of this investigation is to understand what the bottleneck is in the processing and what can be done to remove it. Let's start with cpu utilization. [caption id="attachment_172" align="aligncenter" width="510" caption="Distribution of CPU time spent in userland when not idle"]
Media_httpscaleordief_vwdgl
[/caption] Not terribly loaded (I'm filtering out the long idle portions with user > 5. How about I/O? [caption id="attachment_177" align="aligncenter" width="510" caption="% of CPU spent waiting on IO"]
Media_httpscaleordief_fwbab
[/caption] Interesting, iowait is not negligible. Is it correlated to anything in particular? First of all, let's see how iowait varies with device utilization of v1.
Media_httpscaleordief_ihhfi
v1 is slowly but surely bringing iowait higher, to the point than more than one processor ends up waiting on I/O. To be continued...

Blog battle on the storage appliance front

Backblaze has started an interesting conversation by detailing how they get to $117,000 per PB, down to the type and number of SATA card used in their design. A great PR move for a company in the crowded personal backup space. Of course publishing comparisons with Dell, Sun, NetApp and EMC at 8x, 10x, 30x the price is a sure way to start stirring people's emotions. The first to publish a lengthy response (that StorageMojo could find) is Joerg Moellenkamp in a blog post. Laudable in pointing design flaws for fundamentally 2 different markets. Sure, Sun's hardware is a great piece of engineering, squarely aimed at the enterprise market. Which, incidentally, is not buying in droves and Sun's financials is clearly reflecting that. Backblaze took the google route for storage and it's hard to see, given the competitive pressure, how they would be better off spending their margin on Sun hardware. The era of gold-plated hardware is slowly drawing to a close and I can't say I oppose that change.

Notes from the 2009 Hadoop Summit West

I just got back from Santa Clara where Yahoo and Cloudera were hosting the 2009 Hadoop Summit West on Wednesday followed by a training on Thursday. My interest was one of a prospective user -- to gauge how real and mature hadoop is. The turn-out was more than decent, in the hundreds; a number from Yahoo, running the largest clusters so far, a few folks from Amazon, Facebook, some local universities and a fair number of small companies that have deployed their own clusters (or are running on EC2). The good news first, hadoop is real and it's getting real use. It's clearly a promising platform with active use and development. The scaling model is fairly simple: buy more machines. The current sweet spot is dual-quad hosts with 4x1TB drives and 16GB or so of ECC RAM. Decoupling storage from a central system (à la SAN) is the way to go. Some folks have tried to hook up Thumpers to Niagara chips that run a lot of threads in parallel with some success but the TCO question is unclear. Hence we can start with a handful of cheap machines and go from there. A few things to watch for: the secondary name node for instance, is there here for backup but to persist the DFS layout structures that exist in RAM on the primary name node. It could have been implemented in a more robust fashion using a sql database rather than requiring a re-implementation of redo logs and data files. That's overall the negative point: applications built on the platform (such as hive, hbase and pig) are still pretty much works in progress, somewhat duplication functionality. There is an air of Not Invented Here that still pervades but it's a sign that the whole thing is still young. A vocal user base that meets regularly should help the project focus on the pieces that truly do not exist yet.

A utility function to make a storage decision

I am reproducing an internal twiki post that I put together when we decided to give the Sun Thumper (x4500) a try as an iSCSI target running ZFS.

Introduction

Inspired by a nice paper on utility the following model aims at providing quantitative justifications to choose a SAN solution. This approach relies on the ability to create a "utility" function that takes into account a certain number of factors (explained below) and come up with a mono-dimensional measure (no unit). Factors are:
  • Performance (measured in iops)
  • Capacity (here assumed to be usable, typically raid5 for dev/test, expressed in TB)
  • Availability (aggregate in %)
  • Power (in kWh)
  • Acquisition cost (in $, depreciated over n years)
  • Revenue (in $, if we can tie that to some business figure)
  • Management (in hours per TB)
  • Reliability (fractional and total, in %), a measure of the chances to lose some or all of the datasets

Definitions

Utility = Revenue - Cost(downtime) - Cost(data loss) - Cost(management) - Acquisition Cost(downtime) = (1 - Availability) * SAN size * #developers per TB * hourly developer rate Cost(data loss) = Hourly Failure * SAN size * restore hr per TB * (hourly IT rate * #IT staff per TB +  hourly developer rate * #developers per TB) and Acquisition = capex and opex over the lifetime of the SAN We will assume that revenue is 0 for development and testing. This is of course not true but since any scenario yields the same basic functionality this is a safe assumption to make. We also assume that we have a backup of everything. In case this is impossible, the restore time becomes that of recreating data from scratch. The winner is the solution with the higher utility.

Attempts at using a SunFire x4500 "Thumper" as an iSCSI SAN

For development purposes I was looking for a "cheap" upgrade over our aging Apple XServeRAID fiber channel arrays. These served us well, if one excludes the lack of LUN masking in the later firmware versions, but we have consistently outgrown their native capacity. Besides Apple (and its dismal enterprise support) has stopped selling them so we have been left with no choice but to look elsewhere. The basic requirements are:
  1. Fiber Channel or iSCSI target support to support database workloads
  2. Ability to carve LUNs out of a pool that is large enough (at least 20 TB of raw storage)
  3. Ability to clone volumes
  4. Ability to take snapshots within seconds
There are a few candidates that fullfil this bill: Equallogic, Sun Thumper, 3PAR, Compellent, the XServe replacement from Promise coupled with LVM, various Overland devices. The general hotness of ZFS and the "Try-n-buy" program from Sun made it acceptable to give their hardware and software a try. After all Solaris is not that different from linux (or should I say commute both terms), the hardware is dirt cheap (you can't get much cheaper) and Sun's expertise with hardware systems based on Opterons has been proven in-house on their wonderful SunFire x4600. To make a long story short, the iSCSI target daemon on Solaris 10 is not stable enough for production use. We were plagued with numerous core dumps, causing the iSCSI setup to flickr and initiators to moderately appreciate the frequent interrupts. Setting up OpenSolaris seemed to help a bit but our trust in the stability of the code had suffered an irremediable blow (well, not quite irremediable, but at least for another year or so). Pressed for time I have decided to cut our losses short and to spend more money to get a 48 TB 3PAR E200, vastly more expensive per TB but also known to work. I really wished the Thumper trial had been successful; I believe in OpenSolaris, what I've seen of ZFS makes me green with envy (compared to ext3 + LVM) but I simply cannot justify downtime and/or the hiring of a Solaris core code guru to troubleshoot this mess.

Video on Hadoop from Yahoo!

http://us.dl1.yimg.com/download.yahoo.com/dl/ydn/hadoop.m4v Key points:
  1. cpu, ram size, i/o bandwidth increase exponentially; hard drive seek times do not.
  2. Relational databases and their b-tree datastructures require ln(n) seeks as a crude simplification.
  3. Sort/Merge algorithms working on flat files operate as function of the transfer rates or bandwidth, not seeks (ln(n) is mentioned but I'd think it's at least n.ln(n).
  4. Flat files allow data to not conform to a preconceived schema and is good for exploration
  5. Commodity PCs offer the best computation bang for your bucks but failures are bound to occur frequently; that's the google model of scaling out
Overall this reinforces the idea that MapReduce and its ilk are best-suited for loosely-structured data, or rather data which one does not know how to query and structure beforehand.

On Joyent's recent storage mishaps

I have read with interests comments on Joyent's blog of disgruntled users among which professional system administrators seemed to be found. Beyond the technical merits of the recovery it is quite clear that selling storage services with no backup scheme that would allow data to be available within a matter of hours or a day at best is a dangerous proposition for an internet startup boasting technical prowess. At best it casts a negative light on their understanding of what a backup is and that ultimately I store data remotely not because ZFS or Thumper are cool (they are) but because I hope to be able to retrieve it more elegantly and more simply than going to the bank to retrieve the tape from the vault.

Interestingly enough the brand leader in distributed storage, amazon has an interesting Service Level Agreement that worries solely about the service being available, not that data returned are immune to corruption. I am wondering whether the next step is going to be data insurance, against data loss and format obsolescence. Something to ponder.

In my context neither of these options are appealing since data are the bread and butter of my company, hence the painstaking off-site backup process to mitigate risks and the oh-so-enjoyable-but-needed task of spelling out a comprehensive disaster recovery plan.

Options to build cheap unified storage

We have been running happily with a 3Par S400 in production and it has delivered so far according to our expectations. The ease of management has made the anarchy of our development environments more conspicuous, to the point that we are currently contemplating a few options to get out of the labour-intensive situation.First a bit of context. We are an Oracle shop and the demand to spawn new instances with roughly 1TB of storage has picked up dramatically over the past 6 months. When the company was younger we could simply keep buying Apple XserveRaid and, without LUN masking, manage the mapping between LUNs and hosts manually. We went with Apple because of their prices per GB mostly and the performance is good enough for us, delivering enough IOPS for our needs. Also management was kept to a minimum; again from a cost perspective it made sense. Our Oracle instances use file-based storage management which makes the whole setup relatively easy to grasp.We are now facing storage demands that makes this ad-hoc way of doing things obsolete and border-line dangerous. Besides after interacting with the centralized console of the S400 it is hard to not want to get the same ease of management... for a fraction of the cost. Our options are so far:
  1. Get another 3Par
  2. Get another brand of SAN (e.g. Compellent, XIV, Dell/EMC, etc.) if it turns out to be cheaper
  3. Build scripts to manage existing XServe-RAIDs
  4. Buy a Sun x4500 Thumper and turn it into an iSCSI target.

Option #1 is the most desirable if money were no object. We would then benefit from a uniform storage substrate, maybe get better deals on expansion based on volume purchases, turn on SAN-to-SAN replication to also use that unit as a disaster recovery unit. The downside is, beside cost, vendor lock-in and living at the mercy of proprietary pricing.

Option #2 is mostly interesting for ease of management. Total cost is probably going to be marginally lower and integration with the production infrastructure is less-than-optimal.

For Option #3 if I put my geek hat then it is by far the most appealing option as it promises interesting flexibility at a cost that's hard to beat (a bit over $1 per terabyte). ZFS looks very promising (despite Joyent's recent troubles with the open-source version of Solaris, maybe they should have stuck with the supported Sun version). ZFS would not work for all uses, notably to store database blocks (lest we start using Oracle 11g and its DirectNFS storage). Once I remove my geek hat this proposition is less interesting as we are not a Solaris shop to start with, we have limited experience with iSCSI and the thing has to be production-ready as we would not be able to afford many issues with it.

As for Option #4 it is a variation on #3, albeit at a lower risk. Still the units' limitation on 10.5 TB being split between two discreet controllers means a brittle use of linux's Logical Volume Manager for unix hosts. As long as we are operating storage at the block level (as opposed to a file-level or at a higher-level) this is unlikely to scale.

Which one is it going to be?