NetWare 386 Speed Rating

 Glenn Westin
 Consultant
 Systems Engineering Division

Abstract: This AppNote provides an explanation of the NetWare 386 speed
rating test and the system factors involved in attaining a high rating.
These factors include the CPU chip type, the clock speed, and the main
and memory cache architecture.

Disclaimer

Novell, Inc. makes no representations or warranties with respect to the
contents or use of these Application Notes, or any of the third party
products discussed in the AppNotes. Novell reserves the right to revise
these Application Notes and to make changes in their contents at any
time, without obligation to notify any person or entity of such revisions
or changes. These AppNotes do not constitute an endorsement of the third
party product or products that were tested. The configuration or
configurations tested or described may or may not be the only available
solution. Any test is not a determination of product quality or
correctness, nor does it ensure compliance with any federal, state, or
local requirements. Novell does not warrant products except as stated in
applicable Novell product warranties or license agreements.

Copyright { 1990 by Novell, Inc., Provo, Utah. All rights reserved.

As a means of promoting NetWare Application Notes, Novell grants you
without charge the right to reproduce, distribute and use copies of the
AppNotes provided you do not receive any payment, commercial benefit or
other consideration for the reproduction or distribution, or change any
copyright notices appearing on or in the document.

Introduction

NetWare 386 is a high performance network operating system written
specifically for the 80386 microprocessor. The operating system takes
advantage of the 80386's 32-bit architecture, advanced instruction set
and memory management features. NetWare 386 will only run on an 80386 or
80486 CPU; unlike its less advanced predecessor, NetWare 2.1x, which will
run on an 80286 microprocessor as well as a 386 processor.

Speed Rating

While initializing, the OS performs a system speed test. The speed rating
produced by this test serves two purposes. First the test informs the
system administrator of the file server's current operating speed. (Some
systems possess an AUTO CPU mode or have selectable CPU speeds that start
in low speed. In low speed, some computers run as slow as 8 or even 6
MHz.) Second, the test provides a way to rank file server performance
with respect to CPU types, clock speeds, memory and cache.

On completion of the test a rating is displayed at the server console. A
higher rating indicates a faster system. For example, an 80386SX CPU
running at 16 MHz should get a rating of about 95, while an 80386 CPU
running at 16 MHz should get a rating of about 120 (see Table I). 
Ratings over 600 can be obtained with a properly configured 80486 server.

Table I: Microprocessor Chip Ratings

What Makes it Tick

When reduced to an elementary level, the file server speed test is a
simple loop that runs for approximately 0.16 seconds. The main function
of the test is to count the number of iterations that can be completed
within less than 2/100ths of a second. A larger number of iterations
indicates a faster machine. Because computer instructions are timed in
nanoseconds (there are one billion nanoseconds to a second) any
interference with the loop's operation can greatly alter the final
results.

Before the speed test can begin the floppy drive is reset and then shut
down. This is done because with some computers when the floppy drive is
accessed, the system will automatically switch to slow speed, which
allows copy protection schemes to be properly read. Unlike DOS, NetWare
386 does not reset the speed through the use of a real mode timer.
Therefore, the system would be permanently set in the slow mode which
would have a profoundly negative affect. Once the floppy is shut down,
the speed test loop can begin. At the end of each iteration, the speed
counter is incremented and the system time is checked.

The process of checking the time and incrementing the speed counter
requires several instructions and moves from CPU registers to memory.
After three ticks (approximately 0.16 seconds) the loop is exited. The
results are then divided by 1000, stored in memory (for later retrieval
by typing "speed" at the server console) and displayed at the file server
console. The result is divided by 1000 because the number in the counter
is usually in the five to six digit range, which tends to be quite
unwieldy. The use of these CPU instructions, registers and memory
locations allow the routine to test various computers while maintaining a
consistent testing component throughout. (See Fig. 1.)

Chip Type

Through the use of various CPU instructions, the system's chip type,
clock cycles, memory wait states and cache are exercised. The chip type
is important because of its inherent capabilities for data movement and
the speed in which it can process instructions. The issue of data
movement involves comparison of the SX to the 386 and 486 chips. The SX
is a 32-bit chip, but it performs its data movements to and from memory
in 16-bit chunks, while the 80386 and 80486 chips move their data 32 bits
at a time. Because the SX chip must talk to its memory 16 bits at a time,
it works harder and takes longer to perform the same tasks as the other
80386 and 486 chips.

A main factor is the chip's ability to quickly process instructions,
which involves the chip's architecture. As micro technology has improved,
manufacturers have been able to fit more transistors, and consequently
more functionality, onto microprocessor chips. This is evident in the
chip evolution listed in Table II.

Table II: Chip Evolution

: Speed test flow chart

The Intel 80486 combines four formerly separate components into one chip
: the 80386 architecture, an 8K data cache, a cache controller, and an
80387DX-compatible math coprocessor. Intel also added Burst Mode
high#speed data transfer and a five#stage pipeline feature, which
processes up to five program instructions at once.

The internal data cache of the 486 is far superior to that of the 386's
external cache. When coupled with burst#mode data transfer, 128 bits of
data are transferred into the internal 8K cache from main memory (or an
external cache) with each CPU request. As a result, the 80486 uses fewer
clock cycles to move data than does the 80386.

The 486 can receive four 32#bit data blocks in only five to six clock
cycles. The most efficient 80386 chip transfers one 32#bit block every
two cycles. The 486 can handle multiple instructions at different stages
of completion, which further adds to its ability to complete operations
at its maximum rate of one transfer per clock cycle. These features allow
the 486 to leap far beyond the processing capabilities of its ancestors,
a feat that can be benchmarked through the performance speed rating in
NetWare 386.

Clock Speed

The clock speed of most personal computers is determined by very precise
vibrations of a thin slice of quartz crystal. This crystal may be in a
metal package by itself on the CPU board, or it may be combined with
other circuits into an oscillator module. In either case, the crystal and
oscillator frequency is twice the speed at which the microprocessor
operates. The chip cuts the clock speed in half internally before using
it. In other words, an 80386 that operates at 16 MHz requires a system
clock that operates at 32 MHz.

Clock speed is measured in MHz: millions of cycles (or pulses) per
second. Therefore, a computer's clock counts time in nanoseconds or
billionths of a second. The throughput of a computer (how much
information it can actually process) is directly related to its clock
speed. Hence, a higher clock speed, coupled with the superior
architecture found in the 386 and 486 chips, increases the amount of
instructions that may be performed in a given amount of time.

Memory

One of the primary functions of a microprocessor is the movement of data
to and from memory. The speed at which the system's memory chips operate
can affect the time that these transfers require. By manipulating data
between memory locations and registers (as  the speed rating test does),
the system's memory architecture becomes a factor in determining system
speed.

Ideal Memory System

The ideal memory system is one in which the rate that memory can supply
information to the processor matches the rate the processor can execute
code. If memory is slower than the processor, the system is said to be
bus bound. If the processor is slower than memory, the system is
processor bound. Making one (processor or memory) faster than the other
will probably cost more, but will not improve performance .

Clock Cycles

If memory cannot respond fast enough to meet the processor's demand for
data, the processor must wait one or more clock cycles. The processor is
not necessarily inactive during these clock cycles, because it has
separate bus and execution units. While the bus unit is retrieving
information from memory, the execution unit can perform other operations,
such as register manipulation.

Each clock cycle that the processor has to wait is called a wait state.
Memory fast enough to respond to the CPU in two clock cycles is said to
operate at zero wait states. (A memory access normally requires a minimum
of two clock cycles, one for the microprocessor to send out an
instruction informing the memory system which bytes it wants to read, and
a second cycle for it to provide the contents of the location.)

Each additional wait state increases the memory access by one clock
cycle. Therefore a three cycle memory access would be synonymous with one
wait state (i.e., two cycles equal zero, three cycles equal one wait
state, and so on). Some 80386#based computers may, at times, have to
endure 2 or 3 wait states per memory cycle when accessing system board
memory, and sometimes 16 or more wait states when they read from memory
expansion boards. Wait states are necessary because most PC designs use
Dynamic Random Access Memory (DRAM) chips, which are significantly slower
and less expensive then faster static ram. For this reason, PC designers
also implemented memory cache.

Memory Cache

There are two forms of cache used to improve system performance: memory
cache and disk cache. Memory cache increases RAM performance, while disk
cache improves disk efficiency.

The NetWare speed rating test exercises the memory cache of the file
server. (While disk caching can play an important role in a LAN
environment, it deviates from the main purpose of this appnote, and will
be discussed in detail in future AppNotes.)

Memory cache decreases the time required to fetch the next instruction in
the program code. A memory cache architecture combines fast static random
access memory (SRAM) speed with cost effective DRAMs. It provides a small
amount (usually 32K but this figure can be larger) of fast SRAM (the
cache) that is logically located between the processor and main memory
(which is usually simple DRAM).

SRAM in the cache usually has an access time of 35 nsec to as low as 15
nsec, compared to the 80 to 120 nsec access times of the DRAM used in
main memory. This increased access speed allows swift CPUs to access data
in the cache at zero wait states.

Cache circuitry ensures that the portions of main memory that are most
often used are copied into the cache, making the majority of the memory
accesses to fast memory in the cache, and not to the slower main memory.

Whenever the processor attempts to read a memory location in a system
that uses a memory cache, the memory subsystem checks to see if the
contents of that location are stored in cache. If so, the data is
transferred from the cache at fast SRAM speed (referred to as a cache
hit.)  If the data is not in cache, the processor must wait until the
data can be transferred from slower main memory. This is called a cache
miss and, depending on the speed of the main memory DRAM and the speed of
the CPU, may inject  many wait states into the system. During the wait
states, the contents of the location is also copied into cache, where it
can be accessed more quickly the next time it is needed.

A cache is made effective by the tendencies of most computer programs to
access the same few memory locations over and over, and to access
neighboring locations of those accessed recently. Once those few
locations have been loaded into cache, most accesses are made from the
cache, not from slower main memory. This increases system performance.
NetWare 386, like other well#developed programs, is designed with these
features, and benefits from this technology.

While more cache RAM always helps, it can reach a point of diminishing
returns. Tests prove that a 32K cache will achieve a 96 percent hit rate,
and that doubling or tripling the cache size only improves the hit rate
by one or two percentage points. In most cases the increased cost for the
extra cache memory is not worth such a marginal performance increase.

Conclusion

There are many interrelated factors that go into making a high#speed file
server. The CPU chip type, the clock speed, and the main and memory cache
architecture are the most vital of these factors. The NetWare 386
Processor Speed test exercises all these components to formulate its
results. The information gleaned from such a test can be of use for
selecting the appropriate system and for evaluating a systems daily
operation. But this information is only one element in the overall
performance or throughput of a file server. There are other major
components that must be examined with the same scrutiny. For example the
disk and communications channels should each be tested because, in either
case, an improper selection can cause even the fastest speed#rated
computers to function with less than optimum performance.

