----------------------------------------------------------------------------
                                                        The Florida SunFlash

             Disk Striping and High Performance I/O

SunFLASH Vol 43 #7					           July 1992 
----------------------------------------------------------------------------
Printed with the permission of Dave Taber.
-johnj
----------------------------------------------------------------------------

                          Revision 1.0
                    Copyright 1992 Dave Taber

Abstract:

      This paper discusses basic I/O principles for UNIX systems and
      compares software-based striping with hardware RAID controllers.
      No claims are made regarding the accuracy of information in this
      paper:  it is intended to be interesting and illustrative rather 
      than instructive.
   

Introduction

      One of the distinguishing features of modern science is its
      certainty, quantifiability, and repeatability.  Engineering is
      typically free of mystery and mysticism.  A few areas of
      engineering remain "black arts" however:  acoustics, LF antenna
      design, and high performance UNIX I/O.  

      These areas are so full of complicating factors and uncertainty  
      that there is comparatively little in the way of standard  
      engineering practice.  There fields are characterized by dubious  
      measurements, designs that are very tightly bound to hardware 
      ideosyncracies, and variations so wide that give engineering 
      practice the appearance of a folk art.  

      With this in mind, this paper provides an overview of high-
      performance UNIX I/O.  While it is intended for use with Sun 
      hardware and the SunOS operating system, many of the observations 
      apply to RISC / UNIX systems throughout the industry today.


I/O in the Good Old Days

      When Things Were Simple (in the days of the Cray-1 and KL-10), CPUs 
      were slow, all disks rotated at 3600 rpm or less, and controllers  
      were stupid.  Because controllers were really stupid--with little 
      caching, and not much more functionality than a typical gate array--
      file systems and I/O intensive applications had to pull every trick 
      in the book to move data quickly.  Many of the minis of this era 
      had specialized I/O architectures that allowed data to move quickly 
      from core to disk with little intervention by the CPU.  

      Once the I/O hardware (and related memory systems) got to be fast 
      enough, the next bottleneck was quickly exposed: the disk transfer 
      rate (typcially ~1 MB/s sequential...remember, this is before 1980).
      Striping was invented to allow several simultaneous I/O transfers, 
      multiplying the I/O rate available to an application by 3 times or 
      more.  A single I/O stream was multiplexed across 3 or more disks, 
      slicing the file into "stripes" that were written across the drives.

      Striping was particularly effective in these systems because:

        * Controllers had small caches (often 8 kB or less), and their
          caching algorithms were very simple (single-stream FIFO).

        * The CPU was relatively more powerful than the controller,
          and the particular applications could spare a few CPU cycles
          in order to speed up I/O.  Further, the CPU "cost" of a typical
          I/O was low, and code paths were much shorter than today.

        * The I/Os were large and contiguous; in the scientific community
          that first fell in love with striping, files could be 100 MB
          or more.  Random I/O was much less likely than today (remember,
          NFS hadn't been invented and RDBMSs were little more than
          theoretical exercises).

        * The I/Os were highly predictable, even deterministic.  The  
          systems using striping were usually dedicated to a single  
          "I/O-hot" application; the processes were run on "real-time" 
          systems with fixed I/O priority, or run in batch at very high 
          priority.  

        * Due to the vintage, the systems typically had relatively un-
          sophisticated virtual-memory software; I/O was not complicated
          by the highly adaptive and non-deterministic memory management
          behaviors of modern UNIX.


SunOS I/O

      Life became more complicated with the popularization of UNIX.
      Advanced systems such as SunOS developed unified memory management
      systems that required all disk traffic to compete for RAM space
      and memory bandwidth.  SunOS optimized memory utilization for 
      CPU responsiveness - a very valuable feature for a workstation - 
      while creating a reasonble compromise for I/O bandwidth.  The
      process scheduler made similar tradeoffs, promoting the priority
      of all processes that wanted disk access.  But there was no way
      to truly prioritize one I/O-hot process over another, and there
      was little that could be done to prevent contention among the
      various users of disk bandwidth (e.g., paging).

      The multi-process and virtual memory environment that made
      SunOS so attractive also made I/O performance vulnerable to
      circumstancial factors.  I/O performance was less deterministic,
      and it could be changed by many factors in the instantaneous
      loading of the CPU, RAM, and I/O channels.  As a workstation
      or server typically has a dozen daemons or other sneaky consumers
      of I/O and related system resources, it has become more difficult
      to predict the bandwidth or latency of I/O in operational use.

      In SunOS 4.1.1, a "clustering" version of UFS and the SCSI
      driver were added to the system.  These innovations allowed
      file system performance to exceed 80% of the underlying I/O
      system speed.  The clustering of data -- grouping of up to
      seven 8 kB blocks into a single block -- allowed up to a
      track of data to be moved from RAM to the disk.  In lab
      benchmarks, a SPARCstation 2 with SCSI disks was shown to 
      be able to do amazing things with large sequential transfers.
      In some customer situations, however, I/O speed improvements
      were less noticeable.

      While the kernel threading and the "real time" features of 
      SunOS 5.0 can help reduce the uncertainty of I/O performance,
      these features are unlikely to be counterbalanced by the 
      enormous growth in code-path lengths for I/O.  For all the
      benefits of generality and multiple levels of indirection,
      I/O goes through a very long and complicated path, with new
      potential bottlenecks and interruption points.
      
      To counterbalance this, the CPU has gotten much faster, the 
      busses wider and better arbitrated, and controllers have become 
      full-fledged computers in their own right.  IPI controllers
      typically have the processing power of a Sun 3/60, and with
      1 MB of cache they do some very clever and inscrutable 
      optimizations.

      The net effect has been that the OS, drivers, and controller 
      programshave become incredibly complex -- fast, multithreaded...
      and, unfortunately, very difficult to predict and analyze for 
      I/O performance.

      You don't believe me?  Just ask the four or five experts to
      explain the DiskSuite striping performance.  But I'm getting
      ahead of the story.


Basic Striping

      As explained above, striping consists of multiplexing a single
      large I/O stream across several drives.  This process can be
      thought of as "de-clustering"--breaking an I/O stream apart
      to write, and re-assembling them on reading.

      The first step in striping is to have several physical devices
      treated as a single set of blocks or addresses.  The OS or
      driver sends an I/O request to the striping entity (whether
      hardware or software), and then the striping algorithm dispatches
      as many I/Os as possible as rapidly as possible to the underlying
      devices.  

      Striping involves some interesting side effects.  Although 
      there are several independent disk arms and platters, they are
      all grouped as a logical unity.  This means that all the arms
      move as one (and it's best for the platters to all be synchronized
      as well).  Random accesses are not only _not improved_, they
      are actually more time-consuming than they would be if all the
      disks were independent.  The standard striping algorithm is of
      little use with random-intensive activities such as NFS or 
      relational databases.
      
      For striping to work well, the heads have to be on track nearly
      all the time, and the various steps in the "I/O chain" (CPU-SRAM-
      DRAM-VM-Driver-channel adaptor-disk controller-disk) have to be
      basically available and unconstrained "all the time" (where all ==
      any time I use or measure I/O throughput).  
      
      We have therefore come to my first principal point:  striping is  
      not particularly happy in the UNIX world.  There are too many things 
      going on the system, competing for system resources, for striping 
      to help much of the time.  There are multiple I/O-intensive processes, 
      and most of those are using "random" access patterns a lot of the 
      time.  The net effect is that competing I/O activities keep the I/O
      path busy and move the heads off track.  Striping on a real-world 
      UNIX system will not live up to the expectations of customers who 
      remember the days when I/O was Simple.  

      The idea of the standard striping algorithm is to try to give (or  
      ask for) fairly large amounts of data to a drive (or controller)  
      before moving on to feeding the next drive (or controller).  Since 
      the drive I/O rates are much slower than the bus or controller 
      internal rates, this makes sense as you can keep the underlying 
      device queues full of data or requests.  

      Obviously, keeping these queues full requires very careful _timing_
      and synchronization.  As each disk (usually) is rotating 
      independently, and each controller (or data path) is running 
      asynchronously, high performance requires a very responsive 
      striping entity.  If the striping entity is out to lunch servicing 
      interrupts or running a long servo calculation, performance can 
      become very erratic.

      While hardware-intensive approaches can reduce latencies and
      unpredictability, hardware striping is not cheap.  Typically
      a very special controller is required, and solid array controllers
      require at least 100 K lines of code.  The inexpensive array
      controllers are inadequate; the adequate array controllers
      are economical only when configured with 20 GB or more of disk.

      Sun chose a software-intensive approach because...well I'm not
      going to tell you.  It's too good a story, and I figure it's
      worth at least a beer, and I can probably get you to buy me
      dinner if you really care that much.  Anyway, it turns out 
      to have been the right thing to do from a Marketing point 
      of view, even though we cannot achieve the performance
      of a hardware-intensive system.  Instead, we get flexibility
      and low cost.  You can use any disks you want, any size
      partitions you want, configured according to your wishes.


Pseudo-Device Striping

     Pioneered by Pyramid in 1988, UNIX pseudo-drivers are the 
     technically elegant way to do device-level striping.  Using
     a pseudo-driver loaded into the kernel, several hardware
     devices can be assembled together into a single logical
     device.  The striping algorithm is coded into the driver
     and controlled by a set of utilities and configuration
     files.

     Online: DiskSuite is the second-generation pseudo-device
     driver available from SunSoft.  It implements a simple
     RAID-3 stripe with no option for Parity blocks or
     other redundancy mechanisms.  Several other firms have
     done similar work as Catalyst vendors, with varying degrees
     of success and robustness...but all the products use the
     same basic principles.

     Pseudo-device striping interposes itself between the VM
     system (the Source of All I/O) and other drivers, providing
     yet another level of indirection (YALI) in the system.
     Even though all the work is done in kernel space, the 
     I/O code path gets a little bit longer.  Doing striping
     in the kernel creates some incremental work for the CPU,
     and creates some additional points for bottlenecks or
     at least indeterminacy.

     In the course of moving from your application to the disk
     platter and back again, the data goes through several
     "aliases."  It starts out as user-space data (let's say,
     a file).  It turns into a number of UFS blocks (8 kB),
     can be grouped into a cluster (of up to 56 kB), is 
     transformed into a number of VM pages (which are scattered 
     pseudorandomly throughout RAM), is then sent down as a 
     metadisk page, then a driver page, then broken into 16 
     or more disk blocks (512 bytes each), and finally moves 
     over the I/O bus as a series of blocks or packets.  Very 
     little has actually changed about the data, but it is 
     reaggregated and called different things along the way.  
     Generally speaking, the documentation and discussion of 
     DiskSuite data is talking about raw or file system blocks--
     which are 8 kB each.  Keeping the terminology straight 
     makes the documentation easier to understand.  


RAID Hardware Implementations

     Doing RAID in hardware offloads the CPU, and allows tighter
     coupling of the striping algorithm to the underlying disks.
     This allows for tighter synchronization and higher performance
     than is possible with a pure software implementation.  Many
     RAID array controllers can achieve 10 MB/s or more using
     striping, and a few can achieve 20 MB/s.  Note that most
     Sun systems cannot absorb the data this fast, so there is
     little point in paying extra for RAID speeds beyond 10 MB/s
     sustained.

     Most RAID controllers have an option (or a future prospect) for
     RAID-5 parity, which incrementally improves random write
     performance and provides a measure of redundancy.

     Almost invariably, RAID array controllers have been under-
     designed and oversold.  Consequently, they have not sold
     well (total installed base is estimated at less than 100,000
     units worldwide).

     RAID is NOT good for high availability unless it's incredibly 
     well implemented.  Most RAID boxes have OK performance, but 
     questionable reliability.  The RAID advantage of cheaper-than-
     mirroring redundancy is dubious for the following reasons:

        * RAID has a lot of electronics, which introduce more 
          failure points

        * the RAID controler and power supply are new single-
          points-of-failure

        * RAID is cheaper than mirroring ONLY in large configurations...
          it's actually quite expensive in small ones

         * once you've committed to a RAID box, you are "closing" the 
           open system.  RAID boxes are proprietary, internally... and 
           once you've put the data in there you can't transparently 
           get it back out again.  Said another way, with mirroring you 
           can selectively mirror and unmirror data at will, without any 
           conversion.  With RAID, you have to format the great big disk
           and restore the file system in the RAID format

Benchmarking

     "The nice thing about standards is that there are so many to
     choose from" -- Anonymous

     This same quotation could apply to benchmarks, particularly
     for I/O.  While the SPEC committee has made substantial progress
     in the formulation and standardization of benchmarks for CPU
     and hardware performance, the SPEC benchmarks have generally
     not measured (or even extensively exercised) I/O hardware or
     software subsystems.  This is OK for most workstation use,
     but the closest thing we have to meaningful I/O benchmarks
     are nfsstones/LADDIS and TPC-A, -B, or -C.  These benchmarks
     measure all kinds of things besides pure I/O, however, and
     they are fairly application specific (meaning that they are
     difficult to generalize or extrapolate).

     Within Sun, a benchmark called IObench is used to characterize
     I/O subsystem performance.  IObench "sprays" the I/O system
     with as much I/O of a given type as possible, in order to 
     measure random and sequential raw and file system I/O.  IObench
     appears to be easy to set up and use, but there is signficant
     debate about the meaning of its results.  I know of no corresponding
     benchmark outside of Sun whose results are comparable with
     IObench.

     For measuring Real True Meaningful I/O performance, and pure
     I/O load is probably not the best idea.  In real-world applications,
     there are several applications and system activities that are
     competing with the I/O-intensive process for system resources.
     Running a defined, repeatable load of applications in a 
     tightly controlled hardware and software configuration would
     probably produce the most meaningful benchmark data for 
     striping and other high-performance I/O subsystems.  Unfortunately,
     producing these benchmarks would take a significant amount of
     people and system time, and they could not be easily compared
     across customers (due to peculiarities in hardware configuration
     and applications loads).

     Until The World agrees on an exstensive and standardized set
     of I/O benchmarks, this situation isn't going to change 
     significantly...so I guess we all have to deal with it.

     General guidelines for benchmarking:
       * have repeatable, documented conditions
       * have reasonable configurations and parameters
       * have a realistic, representative workload
       * control the configuration, and vary only one parameter at a time.


Online: DiskSuite striping--expectations and tuning

      OK, this is the section where the rubber meets the road.

      A reasonable expectation for DiskSuite striping is:

         If you set everything up just right AND
         If there aren't any contention or bottleneck problems AND
         If you have large, contiguous files and little random
            access
         Then you can expect that a three-way stripe 
            MIGHT give you 2 disks worth of performance.  MAYBE.
            It's really hard to be sure.
         

      How do you satisfy the above conditions?  Well, here are some 
      hints.  Obviously, you want to have the stripes set up for 
      maximum throughput.  But this varies substantially depending on 
      i/o load, number of users, and prevailing winds in the VM system.  
      The rule of thumb for sequential access (which seems to be what 
      the customer is trying to optimize for) is "have your stripe 
      interleave between 50% and 200% of your typical 'long i/o'".  

      So, setting the interleave very low (less than 16b) isprobably 
      suboptimal (remember that "b" in this case is the disk sector,
      512 bytes).  Setting it between 16 kB-56 kB (32 and 127 b, in
      DiskSuite terms) is usually the right answer.  Your mileage
      may vary.

      Also make SURE that you have set up your file system the
      right way.  Remember we told you about the file system
      clustering implemented in SunOS 4.1.1?  There's a feature
      in DiskSuite that by default turns the clustering OFF.
      If you were previously running with it on, you'll be sorely
      disappointed.  It's very simple to turn the clustering back
      ON again, takes about 20 seconds once you locate the RTF
      for the product or the manual.  Use tunefs to set maxcontig
      to 7, and rotdelay to 0 (for most modern disks).

      But what about general system setup?  It turns out, the best
      things for striping are the best things for I/O in general.
      You want to have as many independent spindles as possible.  
      Don't stripe across spindles that have partitions in use 
      by someone else.  Have as many separate I/O paths as possible--
      one for each disk involved is best.  For SCSI, that means 
      multiple host adaptors.  For SMD and IPI, that means multiple 
      controllers.  Try to have the system disk on as lightly-loaded 
      an I/O channel as possible, so you aren't waiting around for 
      paging.  Try to have as much RAM in the system as you can 
      afford.  For best performance, make sure that all the disks 
      in the stripe are the same basic type and speed (e.g., avoid 
      mixing 3 and 6 MB IPI, or sync and async SCSI drives).

      Use iostat -D 2 to evaluate your I/O load, making sure that
      it's well balanced.

      Things that DON'T matter:
        * size of disk
        * type of interface
        * brand of disk (well, sort of.  Avoid real schlock, second-
             quality product)


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
For information send mail to info-sunflash@sunvice.East.Sun.COM.
Subscription requests should be sent to sunflash-request@sunvice.East.Sun.COM.
Archives are on solar.nova.edu, paris.cs.miami.edu, uunet.uu.net,
src.doc.ic.ac.uk and ftp.adelaide.edu.au

All prices, availability, and other statements relating to Sun or third
party products are valid in the U.S. only. Please contact your local
Sales Representative for details of pricing and product availability in
your region. Descriptions of, or references to products or publications
within SunFlash does not imply an endorsement of that product or
publication by Sun Microsystems.

John McLaughlin, SunFlash editor, flash@sunvice.East.Sun.COM. (305) 776-7770.