---------------------------------------------------------------------------- Sun Database Excelerator SunFLASH Vol 19 #1 July 1990 ---------------------------------------------------------------------------- The Sun Database Excelerator (SunDBE) is an unbundled software product that is an alternative kernel to SunOS 4.1 for Sun customers who have database applications. It speeds up their applications by optimizing virtual memory algorithms. This allows SunDBE to support very large shared memory partitions and large numbers of concurrent users with less overhead than SunOS 4.1. In addition, it provides faster asynchronous I/O for raw disk devices where individual transfers are less than 63k bytes, and provides 1024 file descriptors per process, rather the 256. SunDBE is in beta test now with general availability in late September. SunDBE was designed specifically to optimize the performance of Oracle, Sybase, Ingres, and Informix DBMS products on Sun SPARCservers, but the performance enhancements can also be used by a wide-variety of other applications, such as those with large virtual memory requirements. SunDBE provides full SunOS kernel functionality. The SunDBE release consists of a pre-built kernel and the object files necessary to create custom kernels. SunDBE is intended to be used only on the following Sun-4 server-class machines with at least two disks and 16MB of memory: Sun-4/260 Sun-4/280 SPARCserver 1, 1+ SPARCserver 330 SPARCserver 370 SPARCserver 390 SPARCserver 470 SPARCserver 490 SunDBE will work with the SPARCengine 1, SPARCengine 1+ and SPARCengine 3xx board products since they are the cpu boards that go into the SPARCserver 1, SPARCserver 1+, and SPARCservers 3xx, respectively, with no changes to the SunOS. It will not work with the SPARCengine 1E boards due to their requiring a special version of the SunOS. SunDBE is a not a substitute for an entire SunOS installation. SunOS 4.1 must be installed before SunDBE is installed. SunDBE is supported only on SunOS 4.1. It will not work with earlier versions of SunOS (e.g. SunOS 4.0.3). VM enhancements The virtual memory changes in SunDBE are intended to enhance the performance of the MMU (Memory Management Unit) in situations where there are many competing processes, possibly with large shared memory caches and large text segments. The changes can also help any time there are large virtual memory requirements, regardless of the number of competing processes. The existing sun4 and sun4c machines use two or three level MMU's. Each level of the MMU is a direct map that translates the input address to some output address. Taken as a whole, the MMU translates virtual addresses to physical addresses. In SunOS (and in UNIX in general) each running process has its own address space. The objects in memory that are mapped to address spaces are pages of text segments (instructions), data and stack pages, shared libraries, and shared memory pages. The last map in the address translation process is the Page Map. Page Map entries are organized in groups of 32 (64 on sun4c) called Page Map Entry Groups (or PMEG's). Thus, each PMEG maps a segment of 256k bytes (32*8KB or 64*4KB) of some virtual address space. PMEGs are a way of grouping PME's to allow for more efficient handling of sparse address spaces. In order to make the translation efficient, PMEGs are stored in fast memory. Fast memory is expensive, however, so it is not practical to have enough to hold all the PMEGs necessary for the largest possible amount of virtual memory usage. The numbers of PMEGs for each server type are: SPARCserver 1 128 Sun-4/260 512 Sun-4/280 SPARCserver 3XX 256 SPARCserver 4XX 1024 As the aggregate amount of virtual memory used by all processes increases, the number of PMEG's used also increases. Eventually, when vm requirements become too large, the system "runs out" of PMEGs. For example, on a SPARCserver 1, this would be at 32MB (128*256K), and on SPARCserver 4XX machines, this would be at 256MB (1024 * 256K). Running out of PMEGs is analogous to running out of physical memory. A process that will try to reference a virtual address in a range that does not have a PMEG will trap to the OS and the OS will have to take actions to 'resolve the fault'. The action is to find a free PMEG or to 'steal' one from another process. Note that if a process is large enough, it could steal from itself. Also, the kernel does not have perfect knowledge of all applications, so the decision as to which PMEGs to steal is not always optimal. This can lead to 'PMEG thrashing,' just like insufficient physical memory can lead to page thrashing. A PMEG steal is expensive because for each page that is linked in the corresponding segment, many OS internal data structures must be updated to reflect the invalidation of the translation. Several parts of the VM system are involved. Moreover, when the process that the PMEG was stolen from eventually runs and needs the segment, it will trap to the OS for each page of its working set that belongs to the PMEG and will again update the OS structures to indicate that the translation became valid. Sun uses virtual address space caches that must be partially flushed by the OS if translations are changed. The new improved virtual memory layer uses a flush optimization technique that minimizes the number of flushes that are required and most PMEG steals do not flush at all. On a page fault, an array of SW (software) page table entries (cached translations) is copied to the HW page map group to set up the new translation. The stolen translations are copied to SW so that they are readily available in the future. Thus, SunDBE improves performance by efficiently caching translations, thereby reducing or eliminating costly translations as a result of PMEG steals. Asynchronous I/O Enhancements Multi-threaded server DBMS products enable a single server process to service many DBMS clients. This type of architecture requires asynchronous I/O or each server process will block with every I/O until it completes. The implementation of fast asynchronous I/O in the Sun Database Excelerator is an even faster version than the asynchronous I/O provided in SunOS 4.1. The asynchronous I/O changes in SunDBE are intended to reduce the CPU overhead of asynchronous disk I/O operations done on raw partitions where each I/O buffer size is less than or equal to 63 Kbytes. The "freed" CPU cycles can then be used for other "real" work done by the DBMS to provide increased performance and support a greater number of users. SunDBE asynchronous I/O takes a short path through the kernel code and calls the disk drivers directly to perform each raw disk I/O. So, fewer CPU cycles are required to perform each I/O than are required by generic asynchronous I/O in SunOS 4.1 as it has: 1) a longer code path through the file system layer of the kernel and 2) the added overhead of kernel threads management. File Descriptor enchancements Multi-threaded server DBMS products consume large quantities of file descriptors for managing user connections, disk partitions, and network connections. SunOS 4.1 supports a maximum of 256 open file descriptors per process. This constraint places a hard limit on the number of concurrent users on a system using a Sybase or Ingres DBMS product and affects large servers where customers want to handle hundreds of concurrent users. This Sun Database Excelerator feature increases the file descriptor per process limit to 1024. It will only be accessible to the DBMS products that call it. All other products running on a Sun Database Excelerator kernel will see the normal SunOS 4.1 file descriptors per process limit. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Sunflash is an electronic mail news service from Sun Microsystems, Ft. Lauderdale, Florida, USA. It is targeted at Sun Users and Customers. As a field sales and support office, we try to keep SunFlash useful and interesting to you. If you have any comments or suggestions for enhancing SunFlash, please send them to us. SunFlash is ditributed via a hierarchy of aliases. Please try to address change requests to the owner of the alias that you belong to. Please address comments to the SunFlash editor John McLaughlin (sun!sunvice!flash or flash@sunvice.East.Sun.COM). (305) 776-7770.