A utility function to make a storage decision

I am reproducing an internal twiki post that I put together when we decided to give the Sun Thumper (x4500) a try as an iSCSI target running ZFS.

Introduction

Inspired by a nice paper on utility the following model aims at providing quantitative justifications to choose a SAN solution. This approach relies on the ability to create a "utility" function that takes into account a certain number of factors (explained below) and come up with a mono-dimensional measure (no unit). Factors are:
  • Performance (measured in iops)
  • Capacity (here assumed to be usable, typically raid5 for dev/test, expressed in TB)
  • Availability (aggregate in %)
  • Power (in kWh)
  • Acquisition cost (in $, depreciated over n years)
  • Revenue (in $, if we can tie that to some business figure)
  • Management (in hours per TB)
  • Reliability (fractional and total, in %), a measure of the chances to lose some or all of the datasets

Definitions

Utility = Revenue - Cost(downtime) - Cost(data loss) - Cost(management) - Acquisition Cost(downtime) = (1 - Availability) * SAN size * #developers per TB * hourly developer rate Cost(data loss) = Hourly Failure * SAN size * restore hr per TB * (hourly IT rate * #IT staff per TB +  hourly developer rate * #developers per TB) and Acquisition = capex and opex over the lifetime of the SAN We will assume that revenue is 0 for development and testing. This is of course not true but since any scenario yields the same basic functionality this is a safe assumption to make. We also assume that we have a backup of everything. In case this is impossible, the restore time becomes that of recreating data from scratch. The winner is the solution with the higher utility.

Options to build cheap unified storage

We have been running happily with a 3Par S400 in production and it has delivered so far according to our expectations. The ease of management has made the anarchy of our development environments more conspicuous, to the point that we are currently contemplating a few options to get out of the labour-intensive situation.First a bit of context. We are an Oracle shop and the demand to spawn new instances with roughly 1TB of storage has picked up dramatically over the past 6 months. When the company was younger we could simply keep buying Apple XserveRaid and, without LUN masking, manage the mapping between LUNs and hosts manually. We went with Apple because of their prices per GB mostly and the performance is good enough for us, delivering enough IOPS for our needs. Also management was kept to a minimum; again from a cost perspective it made sense. Our Oracle instances use file-based storage management which makes the whole setup relatively easy to grasp.We are now facing storage demands that makes this ad-hoc way of doing things obsolete and border-line dangerous. Besides after interacting with the centralized console of the S400 it is hard to not want to get the same ease of management... for a fraction of the cost. Our options are so far:
  1. Get another 3Par
  2. Get another brand of SAN (e.g. Compellent, XIV, Dell/EMC, etc.) if it turns out to be cheaper
  3. Build scripts to manage existing XServe-RAIDs
  4. Buy a Sun x4500 Thumper and turn it into an iSCSI target.

Option #1 is the most desirable if money were no object. We would then benefit from a uniform storage substrate, maybe get better deals on expansion based on volume purchases, turn on SAN-to-SAN replication to also use that unit as a disaster recovery unit. The downside is, beside cost, vendor lock-in and living at the mercy of proprietary pricing.

Option #2 is mostly interesting for ease of management. Total cost is probably going to be marginally lower and integration with the production infrastructure is less-than-optimal.

For Option #3 if I put my geek hat then it is by far the most appealing option as it promises interesting flexibility at a cost that's hard to beat (a bit over $1 per terabyte). ZFS looks very promising (despite Joyent's recent troubles with the open-source version of Solaris, maybe they should have stuck with the supported Sun version). ZFS would not work for all uses, notably to store database blocks (lest we start using Oracle 11g and its DirectNFS storage). Once I remove my geek hat this proposition is less interesting as we are not a Solaris shop to start with, we have limited experience with iSCSI and the thing has to be production-ready as we would not be able to afford many issues with it.

As for Option #4 it is a variation on #3, albeit at a lower risk. Still the units' limitation on 10.5 TB being split between two discreet controllers means a brittle use of linux's Logical Volume Manager for unix hosts. As long as we are operating storage at the block level (as opposed to a file-level or at a higher-level) this is unlikely to scale.

Which one is it going to be?