---------------------------------------------------------------------------- The Florida SunFlash Disk Striping and High Performance I/O SunFLASH Vol 43 #7 July 1992 ---------------------------------------------------------------------------- Printed with the permission of Dave Taber. -johnj ---------------------------------------------------------------------------- Revision 1.0 Copyright 1992 Dave Taber Abstract: This paper discusses basic I/O principles for UNIX systems and compares software-based striping with hardware RAID controllers. No claims are made regarding the accuracy of information in this paper: it is intended to be interesting and illustrative rather than instructive. Introduction One of the distinguishing features of modern science is its certainty, quantifiability, and repeatability. Engineering is typically free of mystery and mysticism. A few areas of engineering remain "black arts" however: acoustics, LF antenna design, and high performance UNIX I/O. These areas are so full of complicating factors and uncertainty that there is comparatively little in the way of standard engineering practice. There fields are characterized by dubious measurements, designs that are very tightly bound to hardware ideosyncracies, and variations so wide that give engineering practice the appearance of a folk art. With this in mind, this paper provides an overview of high- performance UNIX I/O. While it is intended for use with Sun hardware and the SunOS operating system, many of the observations apply to RISC / UNIX systems throughout the industry today. I/O in the Good Old Days When Things Were Simple (in the days of the Cray-1 and KL-10), CPUs were slow, all disks rotated at 3600 rpm or less, and controllers were stupid. Because controllers were really stupid--with little caching, and not much more functionality than a typical gate array-- file systems and I/O intensive applications had to pull every trick in the book to move data quickly. Many of the minis of this era had specialized I/O architectures that allowed data to move quickly from core to disk with little intervention by the CPU. Once the I/O hardware (and related memory systems) got to be fast enough, the next bottleneck was quickly exposed: the disk transfer rate (typcially ~1 MB/s sequential...remember, this is before 1980). Striping was invented to allow several simultaneous I/O transfers, multiplying the I/O rate available to an application by 3 times or more. A single I/O stream was multiplexed across 3 or more disks, slicing the file into "stripes" that were written across the drives. Striping was particularly effective in these systems because: * Controllers had small caches (often 8 kB or less), and their caching algorithms were very simple (single-stream FIFO). * The CPU was relatively more powerful than the controller, and the particular applications could spare a few CPU cycles in order to speed up I/O. Further, the CPU "cost" of a typical I/O was low, and code paths were much shorter than today. * The I/Os were large and contiguous; in the scientific community that first fell in love with striping, files could be 100 MB or more. Random I/O was much less likely than today (remember, NFS hadn't been invented and RDBMSs were little more than theoretical exercises). * The I/Os were highly predictable, even deterministic. The systems using striping were usually dedicated to a single "I/O-hot" application; the processes were run on "real-time" systems with fixed I/O priority, or run in batch at very high priority. * Due to the vintage, the systems typically had relatively un- sophisticated virtual-memory software; I/O was not complicated by the highly adaptive and non-deterministic memory management behaviors of modern UNIX. SunOS I/O Life became more complicated with the popularization of UNIX. Advanced systems such as SunOS developed unified memory management systems that required all disk traffic to compete for RAM space and memory bandwidth. SunOS optimized memory utilization for CPU responsiveness - a very valuable feature for a workstation - while creating a reasonble compromise for I/O bandwidth. The process scheduler made similar tradeoffs, promoting the priority of all processes that wanted disk access. But there was no way to truly prioritize one I/O-hot process over another, and there was little that could be done to prevent contention among the various users of disk bandwidth (e.g., paging). The multi-process and virtual memory environment that made SunOS so attractive also made I/O performance vulnerable to circumstancial factors. I/O performance was less deterministic, and it could be changed by many factors in the instantaneous loading of the CPU, RAM, and I/O channels. As a workstation or server typically has a dozen daemons or other sneaky consumers of I/O and related system resources, it has become more difficult to predict the bandwidth or latency of I/O in operational use. In SunOS 4.1.1, a "clustering" version of UFS and the SCSI driver were added to the system. These innovations allowed file system performance to exceed 80% of the underlying I/O system speed. The clustering of data -- grouping of up to seven 8 kB blocks into a single block -- allowed up to a track of data to be moved from RAM to the disk. In lab benchmarks, a SPARCstation 2 with SCSI disks was shown to be able to do amazing things with large sequential transfers. In some customer situations, however, I/O speed improvements were less noticeable. While the kernel threading and the "real time" features of SunOS 5.0 can help reduce the uncertainty of I/O performance, these features are unlikely to be counterbalanced by the enormous growth in code-path lengths for I/O. For all the benefits of generality and multiple levels of indirection, I/O goes through a very long and complicated path, with new potential bottlenecks and interruption points. To counterbalance this, the CPU has gotten much faster, the busses wider and better arbitrated, and controllers have become full-fledged computers in their own right. IPI controllers typically have the processing power of a Sun 3/60, and with 1 MB of cache they do some very clever and inscrutable optimizations. The net effect has been that the OS, drivers, and controller programshave become incredibly complex -- fast, multithreaded... and, unfortunately, very difficult to predict and analyze for I/O performance. You don't believe me? Just ask the four or five experts to explain the DiskSuite striping performance. But I'm getting ahead of the story. Basic Striping As explained above, striping consists of multiplexing a single large I/O stream across several drives. This process can be thought of as "de-clustering"--breaking an I/O stream apart to write, and re-assembling them on reading. The first step in striping is to have several physical devices treated as a single set of blocks or addresses. The OS or driver sends an I/O request to the striping entity (whether hardware or software), and then the striping algorithm dispatches as many I/Os as possible as rapidly as possible to the underlying devices. Striping involves some interesting side effects. Although there are several independent disk arms and platters, they are all grouped as a logical unity. This means that all the arms move as one (and it's best for the platters to all be synchronized as well). Random accesses are not only _not improved_, they are actually more time-consuming than they would be if all the disks were independent. The standard striping algorithm is of little use with random-intensive activities such as NFS or relational databases. For striping to work well, the heads have to be on track nearly all the time, and the various steps in the "I/O chain" (CPU-SRAM- DRAM-VM-Driver-channel adaptor-disk controller-disk) have to be basically available and unconstrained "all the time" (where all == any time I use or measure I/O throughput). We have therefore come to my first principal point: striping is not particularly happy in the UNIX world. There are too many things going on the system, competing for system resources, for striping to help much of the time. There are multiple I/O-intensive processes, and most of those are using "random" access patterns a lot of the time. The net effect is that competing I/O activities keep the I/O path busy and move the heads off track. Striping on a real-world UNIX system will not live up to the expectations of customers who remember the days when I/O was Simple. The idea of the standard striping algorithm is to try to give (or ask for) fairly large amounts of data to a drive (or controller) before moving on to feeding the next drive (or controller). Since the drive I/O rates are much slower than the bus or controller internal rates, this makes sense as you can keep the underlying device queues full of data or requests. Obviously, keeping these queues full requires very careful _timing_ and synchronization. As each disk (usually) is rotating independently, and each controller (or data path) is running asynchronously, high performance requires a very responsive striping entity. If the striping entity is out to lunch servicing interrupts or running a long servo calculation, performance can become very erratic. While hardware-intensive approaches can reduce latencies and unpredictability, hardware striping is not cheap. Typically a very special controller is required, and solid array controllers require at least 100 K lines of code. The inexpensive array controllers are inadequate; the adequate array controllers are economical only when configured with 20 GB or more of disk. Sun chose a software-intensive approach because...well I'm not going to tell you. It's too good a story, and I figure it's worth at least a beer, and I can probably get you to buy me dinner if you really care that much. Anyway, it turns out to have been the right thing to do from a Marketing point of view, even though we cannot achieve the performance of a hardware-intensive system. Instead, we get flexibility and low cost. You can use any disks you want, any size partitions you want, configured according to your wishes. Pseudo-Device Striping Pioneered by Pyramid in 1988, UNIX pseudo-drivers are the technically elegant way to do device-level striping. Using a pseudo-driver loaded into the kernel, several hardware devices can be assembled together into a single logical device. The striping algorithm is coded into the driver and controlled by a set of utilities and configuration files. Online: DiskSuite is the second-generation pseudo-device driver available from SunSoft. It implements a simple RAID-3 stripe with no option for Parity blocks or other redundancy mechanisms. Several other firms have done similar work as Catalyst vendors, with varying degrees of success and robustness...but all the products use the same basic principles. Pseudo-device striping interposes itself between the VM system (the Source of All I/O) and other drivers, providing yet another level of indirection (YALI) in the system. Even though all the work is done in kernel space, the I/O code path gets a little bit longer. Doing striping in the kernel creates some incremental work for the CPU, and creates some additional points for bottlenecks or at least indeterminacy. In the course of moving from your application to the disk platter and back again, the data goes through several "aliases." It starts out as user-space data (let's say, a file). It turns into a number of UFS blocks (8 kB), can be grouped into a cluster (of up to 56 kB), is transformed into a number of VM pages (which are scattered pseudorandomly throughout RAM), is then sent down as a metadisk page, then a driver page, then broken into 16 or more disk blocks (512 bytes each), and finally moves over the I/O bus as a series of blocks or packets. Very little has actually changed about the data, but it is reaggregated and called different things along the way. Generally speaking, the documentation and discussion of DiskSuite data is talking about raw or file system blocks-- which are 8 kB each. Keeping the terminology straight makes the documentation easier to understand. RAID Hardware Implementations Doing RAID in hardware offloads the CPU, and allows tighter coupling of the striping algorithm to the underlying disks. This allows for tighter synchronization and higher performance than is possible with a pure software implementation. Many RAID array controllers can achieve 10 MB/s or more using striping, and a few can achieve 20 MB/s. Note that most Sun systems cannot absorb the data this fast, so there is little point in paying extra for RAID speeds beyond 10 MB/s sustained. Most RAID controllers have an option (or a future prospect) for RAID-5 parity, which incrementally improves random write performance and provides a measure of redundancy. Almost invariably, RAID array controllers have been under- designed and oversold. Consequently, they have not sold well (total installed base is estimated at less than 100,000 units worldwide). RAID is NOT good for high availability unless it's incredibly well implemented. Most RAID boxes have OK performance, but questionable reliability. The RAID advantage of cheaper-than- mirroring redundancy is dubious for the following reasons: * RAID has a lot of electronics, which introduce more failure points * the RAID controler and power supply are new single- points-of-failure * RAID is cheaper than mirroring ONLY in large configurations... it's actually quite expensive in small ones * once you've committed to a RAID box, you are "closing" the open system. RAID boxes are proprietary, internally... and once you've put the data in there you can't transparently get it back out again. Said another way, with mirroring you can selectively mirror and unmirror data at will, without any conversion. With RAID, you have to format the great big disk and restore the file system in the RAID format Benchmarking "The nice thing about standards is that there are so many to choose from" -- Anonymous This same quotation could apply to benchmarks, particularly for I/O. While the SPEC committee has made substantial progress in the formulation and standardization of benchmarks for CPU and hardware performance, the SPEC benchmarks have generally not measured (or even extensively exercised) I/O hardware or software subsystems. This is OK for most workstation use, but the closest thing we have to meaningful I/O benchmarks are nfsstones/LADDIS and TPC-A, -B, or -C. These benchmarks measure all kinds of things besides pure I/O, however, and they are fairly application specific (meaning that they are difficult to generalize or extrapolate). Within Sun, a benchmark called IObench is used to characterize I/O subsystem performance. IObench "sprays" the I/O system with as much I/O of a given type as possible, in order to measure random and sequential raw and file system I/O. IObench appears to be easy to set up and use, but there is signficant debate about the meaning of its results. I know of no corresponding benchmark outside of Sun whose results are comparable with IObench. For measuring Real True Meaningful I/O performance, and pure I/O load is probably not the best idea. In real-world applications, there are several applications and system activities that are competing with the I/O-intensive process for system resources. Running a defined, repeatable load of applications in a tightly controlled hardware and software configuration would probably produce the most meaningful benchmark data for striping and other high-performance I/O subsystems. Unfortunately, producing these benchmarks would take a significant amount of people and system time, and they could not be easily compared across customers (due to peculiarities in hardware configuration and applications loads). Until The World agrees on an exstensive and standardized set of I/O benchmarks, this situation isn't going to change significantly...so I guess we all have to deal with it. General guidelines for benchmarking: * have repeatable, documented conditions * have reasonable configurations and parameters * have a realistic, representative workload * control the configuration, and vary only one parameter at a time. Online: DiskSuite striping--expectations and tuning OK, this is the section where the rubber meets the road. A reasonable expectation for DiskSuite striping is: If you set everything up just right AND If there aren't any contention or bottleneck problems AND If you have large, contiguous files and little random access Then you can expect that a three-way stripe MIGHT give you 2 disks worth of performance. MAYBE. It's really hard to be sure. How do you satisfy the above conditions? Well, here are some hints. Obviously, you want to have the stripes set up for maximum throughput. But this varies substantially depending on i/o load, number of users, and prevailing winds in the VM system. The rule of thumb for sequential access (which seems to be what the customer is trying to optimize for) is "have your stripe interleave between 50% and 200% of your typical 'long i/o'". So, setting the interleave very low (less than 16b) isprobably suboptimal (remember that "b" in this case is the disk sector, 512 bytes). Setting it between 16 kB-56 kB (32 and 127 b, in DiskSuite terms) is usually the right answer. Your mileage may vary. Also make SURE that you have set up your file system the right way. Remember we told you about the file system clustering implemented in SunOS 4.1.1? There's a feature in DiskSuite that by default turns the clustering OFF. If you were previously running with it on, you'll be sorely disappointed. It's very simple to turn the clustering back ON again, takes about 20 seconds once you locate the RTF for the product or the manual. Use tunefs to set maxcontig to 7, and rotdelay to 0 (for most modern disks). But what about general system setup? It turns out, the best things for striping are the best things for I/O in general. You want to have as many independent spindles as possible. Don't stripe across spindles that have partitions in use by someone else. Have as many separate I/O paths as possible-- one for each disk involved is best. For SCSI, that means multiple host adaptors. For SMD and IPI, that means multiple controllers. Try to have the system disk on as lightly-loaded an I/O channel as possible, so you aren't waiting around for paging. Try to have as much RAM in the system as you can afford. For best performance, make sure that all the disks in the stripe are the same basic type and speed (e.g., avoid mixing 3 and 6 MB IPI, or sync and async SCSI drives). Use iostat -D 2 to evaluate your I/O load, making sure that it's well balanced. Things that DON'T matter: * size of disk * type of interface * brand of disk (well, sort of. Avoid real schlock, second- quality product) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ For information send mail to info-sunflash@sunvice.East.Sun.COM. Subscription requests should be sent to sunflash-request@sunvice.East.Sun.COM. Archives are on solar.nova.edu, paris.cs.miami.edu, uunet.uu.net, src.doc.ic.ac.uk and ftp.adelaide.edu.au All prices, availability, and other statements relating to Sun or third party products are valid in the U.S. only. Please contact your local Sales Representative for details of pricing and product availability in your region. Descriptions of, or references to products or publications within SunFlash does not imply an endorsement of that product or publication by Sun Microsystems. John McLaughlin, SunFlash editor, flash@sunvice.East.Sun.COM. (305) 776-7770.