"I/O operations per second"
I have been involved in determining the amount of I/O that our storage subsystem needs to support for the entire production infrastructure of the company. When you go and talk to storage vendors the metric that they uniformely use to describe throughput is
iops. Since their systems are typically comprised of one or more controllers and a number of hard-drives, they will quote the number of
iops that a controller can handle. That number is much higher than that of an individual drive so to deliver I/O throughput they have to stripe and mirror data across many drives.
The natural tool on linux to gather i/o stats is, well
which delivers meaningful data when invoked as
iostat -x -t -d 15 to get averages and counts over 15 seconds. I actually slightly modified iostat to output a more meaningful time stamp (note to self, suggest patch). Once I had endless logs of iostat I set up a simple table in postgresql following the schema:
-- postgresql table, simple if not optimal
create table iostat
(
sid serial primary key,
tstamp timestamp not null,
machine varchar(255) not null,
device varchar(16) not null,
rrqms real not null,
wrqms real not null,
rs real not null,
ws real not null,
rsecs real not null,
wsecs real not null,
avgrq_sz real not null,
avgqu_sz real not null,
await real not null,
svctm real not null,
util real not null
);
create index iostat1 on iostat(tstamp);
create index iostat2 on iostat(machine);
create index iostat3 on iostat(device);
No, it's not fully normalized, especially as far as machine and devices are concerned, but that's good enough for my little research. I also put together a simplistic awk script (which for some reason won't show properly in this blog); that scriptt can be invoked as:
gawk -f iostat.awk my_iostat_log | psql my_db
Then it's really a matter of crafting the proper sql query to isolate interesting data and output them in a csv format. That csv file can then be fed to
R to look for correlations. If you really have a lot of stat data, other tools can be used for analysis (in particular if you start having tens of thousands of measurements). In my case R fit the bill perfectly and was able to crunch data fast enough.
Now let me take a concrete example when this contraption was put to use. In a few hours this week I had to compare the performance of two database instances (one Oracle 9i, one Oracle 10g), hosted by the same hardware and reading their data files from the same SAN (an Apple XServe RAID, using the same slice but different files) using a copper-based HBA. I had noticed differences in performance but I did not know whether the discrepancy is to be found in the way the newer version of Oracle interacts with the underlying I/O subsystem, or whether, despite the use of identical hardware, something in the storage layer itself would cause that performance discrepancy. To make things simpler we are focussing on write activity, not read activity. The tests that ran to generate i/o loads were
write-mostly tests (to be precise, they were aimed at exercising large binary object writes).
This is where R comes in handy.
Let's begin by isolating what intuitively are variables -- I'm using the iostat man page to make sense of the various measures:
-
ws: number of write requests issued to the device
-
wrqms: number of merged write requests issued to the device
-
await: average wait for an i/o request to the device, including time spent queued
-
svctm: average service time by the device for all requests
The difference between merged requests and issued requests is that the linux i/o subsystem, by using variations of elevator algorithms tries to minimize the number of actual requests sent to the device, while still achieving some throughput. If it were to send every single requests to the device regardless of the physical location to the disk, we might end up with completely random access, whereas hard drives much prefer sequential access. So i/o requests are grouped and re-ordered when possible.
Hence we should have
ws < wrqms
By the same token, since await is time spent servicing a request by the device and time spent waiting in the i/o queues, we should also have
svctm < await
If we look at the relationship between svctm and await we get the following:
on 9i and:
on 10g.
Notice the obviously linear relationship between both variables. This is a good thing, it means that the SAN is still able to serve requests and not have to queue them up. To be sure if the hard drives behind the SAN take longer on average to service a request (which could be because the average request size grows), in other words if svctm grows, await grows as well but it does not grow much faster.
Although both plots look similar the second one, taken while 10g is running has average waits twice as long, for no obvious reason yet.
Next let's look at the average request size in sectors on 9i (1, on the left) and 10g (2, on the right).
Write requests on 9i are much more consistent in size. Of course even though the high-level test is the same, internally the database engine decides to layout data to disk in a different fashion between the two versions. Not only that but the block device abstraction that the fibre channel driver presents makes the decision to optimize request merges much more difficult to make at the OS-level. To some extent, that is what entreprise-y SAN vendors sell you, the promise that their hardware and software will always perform I/O in the most optimal fashion, behind the scenes. To the host it's all mysterious what happens on the other end of the fibre. Optimizing the fibre channel driver makes in this context a lot of sense.
We can look for other interesting patterns in the data, for instance, does the number of write requests impact service time? It should matter of course as there ought to be a hard limit on the number of requests that the I/O subsystem can take without starting to slow down. Plotting average service time agains number of write requests gave a plot that looked awfully like y = 1/x. Plotting the inverse of the average service time against the number of write requests gave me the surprising but nice plot for both 9i and 10g (superposed here):
I was quite surprised at first, then realized that the size of the write requests was not constant. The size of the request, a hidden variable, can explain why, when there are few requests sent during a time quatum, average service time is high (it's here measured in ms^-1). Per iostat's documentation svctm does not take into account time spent in Linux's I/O queues. But there are other queues here, for instance the buffers used by the fibre channel driver before transmission to the SAN. With available data I can only formulate hypothesis as to why I have not seen 2000 write requests taking on average 1 ms (instead of 0.5 ms). I can attribute that to the fact that the SAN fabric (which comprises the SAN hardware as well as the HBA driver) has limited buffer sizes. Hence the only way for 2000 requests to be processed in that time quantum is for them to be smaller than for the same amount of data to be split across fewer requests. So you can either write a lot of small requests or fewer, bigger requests, but you can't have both in the test set-up described above. The fact that requests vary in size is obviously a problem to further explore this hypothesis.
A few things remain consistent with intuition; for instance the bigger the request on average, the longer it takes to service:
What is mysterious is the difference between 9i in blue and 10g in red. I don't have a good explanation for that.
In the end it turned out that there was a hidden parameter in Oracle 10g that decides of the size of the log buffer and that solved the slow-down problem. So all this investigation turned out to be not needed, but it was fun enough that I did not feel that I wasted my time.