NetWare 386 Speed Rating Glenn Westin Consultant Systems Engineering Division Abstract: This AppNote provides an explanation of the NetWare 386 speed rating test and the system factors involved in attaining a high rating. These factors include the CPU chip type, the clock speed, and the main and memory cache architecture. Disclaimer Novell, Inc. makes no representations or warranties with respect to the contents or use of these Application Notes, or any of the third party products discussed in the AppNotes. Novell reserves the right to revise these Application Notes and to make changes in their contents at any time, without obligation to notify any person or entity of such revisions or changes. These AppNotes do not constitute an endorsement of the third party product or products that were tested. The configuration or configurations tested or described may or may not be the only available solution. Any test is not a determination of product quality or correctness, nor does it ensure compliance with any federal, state, or local requirements. Novell does not warrant products except as stated in applicable Novell product warranties or license agreements. Copyright { 1990 by Novell, Inc., Provo, Utah. All rights reserved. As a means of promoting NetWare Application Notes, Novell grants you without charge the right to reproduce, distribute and use copies of the AppNotes provided you do not receive any payment, commercial benefit or other consideration for the reproduction or distribution, or change any copyright notices appearing on or in the document. Introduction NetWare 386 is a high performance network operating system written specifically for the 80386 microprocessor. The operating system takes advantage of the 80386's 32-bit architecture, advanced instruction set and memory management features. NetWare 386 will only run on an 80386 or 80486 CPU; unlike its less advanced predecessor, NetWare 2.1x, which will run on an 80286 microprocessor as well as a 386 processor. Speed Rating While initializing, the OS performs a system speed test. The speed rating produced by this test serves two purposes. First the test informs the system administrator of the file server's current operating speed. (Some systems possess an AUTO CPU mode or have selectable CPU speeds that start in low speed. In low speed, some computers run as slow as 8 or even 6 MHz.) Second, the test provides a way to rank file server performance with respect to CPU types, clock speeds, memory and cache. On completion of the test a rating is displayed at the server console. A higher rating indicates a faster system. For example, an 80386SX CPU running at 16 MHz should get a rating of about 95, while an 80386 CPU running at 16 MHz should get a rating of about 120 (see Table I). Ratings over 600 can be obtained with a properly configured 80486 server. Table I: Microprocessor Chip Ratings What Makes it Tick When reduced to an elementary level, the file server speed test is a simple loop that runs for approximately 0.16 seconds. The main function of the test is to count the number of iterations that can be completed within less than 2/100ths of a second. A larger number of iterations indicates a faster machine. Because computer instructions are timed in nanoseconds (there are one billion nanoseconds to a second) any interference with the loop's operation can greatly alter the final results. Before the speed test can begin the floppy drive is reset and then shut down. This is done because with some computers when the floppy drive is accessed, the system will automatically switch to slow speed, which allows copy protection schemes to be properly read. Unlike DOS, NetWare 386 does not reset the speed through the use of a real mode timer. Therefore, the system would be permanently set in the slow mode which would have a profoundly negative affect. Once the floppy is shut down, the speed test loop can begin. At the end of each iteration, the speed counter is incremented and the system time is checked. The process of checking the time and incrementing the speed counter requires several instructions and moves from CPU registers to memory. After three ticks (approximately 0.16 seconds) the loop is exited. The results are then divided by 1000, stored in memory (for later retrieval by typing "speed" at the server console) and displayed at the file server console. The result is divided by 1000 because the number in the counter is usually in the five to six digit range, which tends to be quite unwieldy. The use of these CPU instructions, registers and memory locations allow the routine to test various computers while maintaining a consistent testing component throughout. (See Fig. 1.) Chip Type Through the use of various CPU instructions, the system's chip type, clock cycles, memory wait states and cache are exercised. The chip type is important because of its inherent capabilities for data movement and the speed in which it can process instructions. The issue of data movement involves comparison of the SX to the 386 and 486 chips. The SX is a 32-bit chip, but it performs its data movements to and from memory in 16-bit chunks, while the 80386 and 80486 chips move their data 32 bits at a time. Because the SX chip must talk to its memory 16 bits at a time, it works harder and takes longer to perform the same tasks as the other 80386 and 486 chips. A main factor is the chip's ability to quickly process instructions, which involves the chip's architecture. As micro technology has improved, manufacturers have been able to fit more transistors, and consequently more functionality, onto microprocessor chips. This is evident in the chip evolution listed in Table II. Table II: Chip Evolution : Speed test flow chart The Intel 80486 combines four formerly separate components into one chip : the 80386 architecture, an 8K data cache, a cache controller, and an 80387DX-compatible math coprocessor. Intel also added Burst Mode high#speed data transfer and a five#stage pipeline feature, which processes up to five program instructions at once. The internal data cache of the 486 is far superior to that of the 386's external cache. When coupled with burst#mode data transfer, 128 bits of data are transferred into the internal 8K cache from main memory (or an external cache) with each CPU request. As a result, the 80486 uses fewer clock cycles to move data than does the 80386. The 486 can receive four 32#bit data blocks in only five to six clock cycles. The most efficient 80386 chip transfers one 32#bit block every two cycles. The 486 can handle multiple instructions at different stages of completion, which further adds to its ability to complete operations at its maximum rate of one transfer per clock cycle. These features allow the 486 to leap far beyond the processing capabilities of its ancestors, a feat that can be benchmarked through the performance speed rating in NetWare 386. Clock Speed The clock speed of most personal computers is determined by very precise vibrations of a thin slice of quartz crystal. This crystal may be in a metal package by itself on the CPU board, or it may be combined with other circuits into an oscillator module. In either case, the crystal and oscillator frequency is twice the speed at which the microprocessor operates. The chip cuts the clock speed in half internally before using it. In other words, an 80386 that operates at 16 MHz requires a system clock that operates at 32 MHz. Clock speed is measured in MHz: millions of cycles (or pulses) per second. Therefore, a computer's clock counts time in nanoseconds or billionths of a second. The throughput of a computer (how much information it can actually process) is directly related to its clock speed. Hence, a higher clock speed, coupled with the superior architecture found in the 386 and 486 chips, increases the amount of instructions that may be performed in a given amount of time. Memory One of the primary functions of a microprocessor is the movement of data to and from memory. The speed at which the system's memory chips operate can affect the time that these transfers require. By manipulating data between memory locations and registers (as the speed rating test does), the system's memory architecture becomes a factor in determining system speed. Ideal Memory System The ideal memory system is one in which the rate that memory can supply information to the processor matches the rate the processor can execute code. If memory is slower than the processor, the system is said to be bus bound. If the processor is slower than memory, the system is processor bound. Making one (processor or memory) faster than the other will probably cost more, but will not improve performance . Clock Cycles If memory cannot respond fast enough to meet the processor's demand for data, the processor must wait one or more clock cycles. The processor is not necessarily inactive during these clock cycles, because it has separate bus and execution units. While the bus unit is retrieving information from memory, the execution unit can perform other operations, such as register manipulation. Each clock cycle that the processor has to wait is called a wait state. Memory fast enough to respond to the CPU in two clock cycles is said to operate at zero wait states. (A memory access normally requires a minimum of two clock cycles, one for the microprocessor to send out an instruction informing the memory system which bytes it wants to read, and a second cycle for it to provide the contents of the location.) Each additional wait state increases the memory access by one clock cycle. Therefore a three cycle memory access would be synonymous with one wait state (i.e., two cycles equal zero, three cycles equal one wait state, and so on). Some 80386#based computers may, at times, have to endure 2 or 3 wait states per memory cycle when accessing system board memory, and sometimes 16 or more wait states when they read from memory expansion boards. Wait states are necessary because most PC designs use Dynamic Random Access Memory (DRAM) chips, which are significantly slower and less expensive then faster static ram. For this reason, PC designers also implemented memory cache. Memory Cache There are two forms of cache used to improve system performance: memory cache and disk cache. Memory cache increases RAM performance, while disk cache improves disk efficiency. The NetWare speed rating test exercises the memory cache of the file server. (While disk caching can play an important role in a LAN environment, it deviates from the main purpose of this appnote, and will be discussed in detail in future AppNotes.) Memory cache decreases the time required to fetch the next instruction in the program code. A memory cache architecture combines fast static random access memory (SRAM) speed with cost effective DRAMs. It provides a small amount (usually 32K but this figure can be larger) of fast SRAM (the cache) that is logically located between the processor and main memory (which is usually simple DRAM). SRAM in the cache usually has an access time of 35 nsec to as low as 15 nsec, compared to the 80 to 120 nsec access times of the DRAM used in main memory. This increased access speed allows swift CPUs to access data in the cache at zero wait states. Cache circuitry ensures that the portions of main memory that are most often used are copied into the cache, making the majority of the memory accesses to fast memory in the cache, and not to the slower main memory. Whenever the processor attempts to read a memory location in a system that uses a memory cache, the memory subsystem checks to see if the contents of that location are stored in cache. If so, the data is transferred from the cache at fast SRAM speed (referred to as a cache hit.) If the data is not in cache, the processor must wait until the data can be transferred from slower main memory. This is called a cache miss and, depending on the speed of the main memory DRAM and the speed of the CPU, may inject many wait states into the system. During the wait states, the contents of the location is also copied into cache, where it can be accessed more quickly the next time it is needed. A cache is made effective by the tendencies of most computer programs to access the same few memory locations over and over, and to access neighboring locations of those accessed recently. Once those few locations have been loaded into cache, most accesses are made from the cache, not from slower main memory. This increases system performance. NetWare 386, like other well#developed programs, is designed with these features, and benefits from this technology. While more cache RAM always helps, it can reach a point of diminishing returns. Tests prove that a 32K cache will achieve a 96 percent hit rate, and that doubling or tripling the cache size only improves the hit rate by one or two percentage points. In most cases the increased cost for the extra cache memory is not worth such a marginal performance increase. Conclusion There are many interrelated factors that go into making a high#speed file server. The CPU chip type, the clock speed, and the main and memory cache architecture are the most vital of these factors. The NetWare 386 Processor Speed test exercises all these components to formulate its results. The information gleaned from such a test can be of use for selecting the appropriate system and for evaluating a systems daily operation. But this information is only one element in the overall performance or throughput of a file server. There are other major components that must be examined with the same scrutiny. For example the disk and communications channels should each be tested because, in either case, an improper selection can cause even the fastest speed#rated computers to function with less than optimum performance.