SegMentor
Minimizing Swapping and Linking Overhead with SegMentor 
-----------------------------------------------------------------

Introduction

Mention segment map tuning to a programmer today and you'll usually get a
loud groan. Most of us detest such an assignment, knowing it to be a trial
and error process that is tedious, time consuming, and very frustrating.
Unfortunately, however, it is a job that typically cannot be ignored,
unless you don't mind a program that "thrashes" in low memory situations
-- using nearly all the computer's cycles on retrieving and linking
segments, and relatively few on actually executing functions. In such
environments good segmentation can improve overall application performance
by 2X to 10X. Even with plenty of memory, an untuned segment map can cause
from 5% to 10 % overall performance degradation, due to the overhead of
far calls. This is especially true on larger and function intensive (like
C++) applications.

Dynamic linkers, virtual memory managers, and segmentation

Through the dynamic linking and virtual object management systems (VM) in
Windows, applications that are larger than available memory can be run by
having a memory manager load code segments in from disk as needed at
runtime. Because these segments are dynamically linked into the running
programs, they can be loaded anywhere in memory. So the developer must no
longer decide what combinations of code segments can be resident at the
same time, as is the case when breaking up a program for overlays.

While VM and dynamic linking relieve the programmer of having to decide
where the code segments will reside, they do not address the issue of
optimizing the content of each code segment. So, if the developer is
concerned about performance, he is still left to determine by hand the
best allocation of functions to code segments.

NOTE: When loading a segment, virtual memory managers often need to
displace one or more resident segments to make room for the new one. VM
systems use least recently used (LRU) algorithms to calculate which
segment(s) should be removed. This reduces the chances of needing to
re-load and re-link a segment immediately after it is displaced. Windows
and OS/2 include this functionality; for DOS it is provided with add-ons
like .RTLinkVM or VROOMM from Borland. This type of optimization is
independent of, and complementary to, segmentation optimization. VROOMM
and other like technologies address which segment(s) should be displaced
to make room for a new one; SegMentor addresses which functions should be
allocated to the various segments.

How segmentation affects performance

Good segmentation groups functions into segments based on how frequently
the functions call each other. The goal is to minimize the number of
cross-segment function calls that occur as the application runs. Overall
application performance is improved in two ways by minimizing
cross-segment calls.

First, when a cross-segment call occurs, there is some probability (which
is a function of the amount of available memory) that the segment
containing the target function won't be in memory. If it is not, it will
have to be loaded from disk. Good segmentation, by reducing cross-segment
calls, will ultimately reduce a like percentage of disk hits. Because disk
hits are very expensive (a function call to a function on disk is about
20,000 times slower (386 Windows protected mode with 4K segments) than a
call to a memory resident function), the effect on overall application
performance can be dramatic.

Using a disk cache such as SMARTDrive is a help. But even when the target
function is cached in memory, the function call will still be more
expensive (2000X more so!) than calls to 'resident' functions. SMARTDrive
resident functions are already in memory, but they are not directly
available. To execute these functions, their segment must still be loaded
and linked into the running program. SMARTDrive saves a disk hit, but it
cannot eliminate the (expensive) loading and linking process. Good
segmentation will complement SMARTDrive and produce even larger
application speedups. Another benefit of a well-segmented application is
that the memory required for acceptable performance will be lower. A rule
of thumb is that a randomly or poorly segmented application requires
30-40% more memory than a well segmented application.

The second reason minimizing cross-segment calls improves performance is
that such calls are always far calls under DOS,

Windows and OS/2 1.x. Far calls take longer to execute than calls within a
segment. An in-segment far call can be optimized by the linker so that it
takes about half as long to execute as a call to a function in another
memory resident segment. And an explicitly declared near call takes even
less time. So, if an entirely memory resident application spends 10% of
its time just making function calls, good segmentation can cut this time
in half, thus speeding up the application's overall performance by 5%.

(Note that the near/far call distinction goes away with the full 32 bit
linear addressing schemes in OS/2 2.0 and 32 bit versions of Windows like
NT. However even in these environments, applications will continue to
encounter low memory situations, forcing code segments (or pages) to be
loaded from disk. Thus, for optimum performance, paged environments still
require intelligent segmentation -- or intelligent placement/ordering of
functions within an executable file).

An unrealistic but illustrative example

Imagine an application has 5 functions: a, bl, b2, e1, e2 . Each function
is about 2 Kbytes in size, and you want to segment the application into 4K
code segments (so two functions will fit in each segment, and the
application will have to be split into three segments). In a typical
runtime usage of the application, a calls bl 10 times, then a calls e1 10
times. Each time bl is called, it calls b2 10 times, and each time e1 is
called, it calls e2 10 times. So a total of 220 function calls occur when
the application runs.

Let's explore the impact of several segmentation alternatives.

A worst case segmentation for this application would be to place bl and e1
together in the same segment, b2 and e2 together, and a in a segment by
itself. In this case, all 220 function calls that occur will be
cross-segment calls rather than in-segment calls.

The best case segmentation for this application is to put bl and b2
together, e1 and e2 together, and a by itself. In this case, there are 20
cross-segment function calls and 200 insegment function calls.

If there were at least 12K of memory available on the machine (and no other
applications were running), all of the code segments will fit in memory
and 0 'disk-hit' function calls will occur for both the worst case and the
best case segmentation. The worst-case application would run a little
slower though, because all of its function calls are far. If only 8K of
memory were available (so that only 2 code segments could be resident at a
time), the worst case solution would induce 40 disk-hits while the best
case solution would only induce 2.

Additionally, if only 4K were available, the worst case solution would
induce 440 disk hits (returns from functions in addition to calls can
require a code segment to be read from disk) while the best-case solution
would only induce 40.

When we look at the overhead incurred by the various types of calls, we see
why applications get incredibly slow in tight memory situations. These
figures (in microseconds) are estimates based on a 25MHz 386 running
Windows in protect mode with 4K segments:

near call:                 .3 
in-segment far call:       .6 
cross-segment far call:    1.4 
SMARTDrive resident call:  2,000
disk hit call:             22,000

As you can see, hits to functions on disk or in the SMARTDrive cache are
vastly more expensive than calls to functions within memory. Good
segmentation can produce huge gains in application performance by reducing
these types of calls.

Ramifications of segment size

A possible solution to the segmentation problem would be to have just one
function per segment. This would allow an application to run efficiently
in the minimum amount of memory. The application would make the best
possible use of memory and would have the fewest possible disk-hit calls.
However, there are two problems with this approach.

The principal problem is that Windows and OS/2 place a limit on the number
of segments allowed in an application (the limit for Windows is currently
253 segments to an application). Therefore, this approach isn't feasible
for larger applications. Applications that are small enough to use this
scheme probably don't need to be very concerned about low
memory/segmentation issues because they are memory resident at all times
anyway.

The second problem is that assigning one function per segment forces all
calls to be far. For many applications, this won't have a very big impact
on performance, but for some (either because they're very time-critical or
very function-call intensive) this can be quite important.

Using very large segments is another alternative. Large segments have the
advantage of minimizing the number of far calls. However, as segments grow
larger, more time is required to read and dynamically link them into a
program. Also, as the segment size increases, the minimum memory required
by the application tends to increase.

In practice, a code segment size of 4K to 8K produces the best overall
results (it is not by chance that paged memory schemes have settled on 4K
as the fixed page size).

Summary of the segmentation problem

The problem, then, is to place an application's functions into segments in
a way that will minimize the number of cross-segment calls that occur when
the application runs.

The data needed to solve this problem is: a list of the functions in the
application; the size of each function; and finally the sequence of
function calls that occur when the application is run (in a typical
session).

While the problem of finding an optimal solution is a simple one, the
solution space that must be searched is very large. For example, assuming
you have to place n equal sized functions into m segments, there are on
the order of nm possible solutions to consider. This problem is probably
NP complete -- like the traveling salesman problem of finding the shortest
possible route that visits N different cities.

Inadequate solutions

Leaving segmentation up to the linker is the default method. However, this
produces poor results because linkers don't have access to the runtime
function call history of an application, and thus cannot make intelligent
segmentation decisions.

To group functions into segments, linkers can make use of two pieces of
data. First, they can allocate functions based on how they are grouped in
their source code files. This is a reasonable idea, since functions in the
same file tend to call each other. Second, a linker can make use of the
'static' call-tree data on an application. A linker can know, for example,
that function bl calls function b2 (what a linker can't know is how many
times bl calls b2, and it also can't know about hidden function calls that
occur through function pointers, call backs, etc.).

Generally, even making use of the above data, a linker's segmentation of
code is only slightly better than random segmentation. This is because the
'static' data that is available to the linker is simply inadequate.

Another option is to hand segment an application. This approach has three
problems.

First, you typically have the same problem the linker does: lack of
sufficient data. In particular, most DOS/Windows/OS/2 profilers do not
provide information on 'who calls who, and how many times'. They can only
provide this calling information if you specify which calls to trace --
and you can only trace a limited number of calls in one pass. For proper
segmentation, you need a comprehensive, application-wide call-tree summary
that represents a broad range of end-user type activity. This runtime
function call history is critical information for segmentation decisions.
Without it, you must make educated guesses about how strongly linked a
given set of functions are (that is, how frequently they call each other).
For small applications with relatively few functions, this can be a
manageable method. But with larger applications, an educated-guess
approach becomes impractical.

Second, even given the necessary data, finding a good solution by hand is
very difficult. What you find is that performance is improved in the area
you considered, but at the expense of slowing other parts of the program.
This is because when functions are moved to optimize one area of the
program, the change removes those same functions from other segments where
they may be frequently used.

When doing this job by hand it is impossible to know every implication of a
change in segmentation. Given a modestly sized application, a person
simply cannot consider all the ramifications of a single solution, let
alone compare that solution to others. Very simply, computers do a better
job of solving problems of this class.

Finally, hand solutions are labor intensive and require a senior engineer
with a thorough knowledge of the entire application. This makes such an
approach very, very expensive in both time and money. This is especially
the case when an application is modified fairly frequently, thus requiring
constant changes to its segmentation.

Why the runtime frequency of each function call is vital to optimal
segmentation

There are tools on the market that claim to compute optimal segmentation by
analyzing the "static" links between functions. In other words, they
calculate "optimal" segmentation by grouping functions into segments based
on which functions can call which other functions. Unfortunately, this
does not work.

While this static information is helpful for calculating segmentation, it
represents only part of the data required. The actual frequencies of each
function call, captured by tracing the application at runtime, are
absolutely vital to the process.

If it were possible to compute ideal segmentation without frequency
information, the compiler vendors, who have access to this static
information, would have already done it! But, because compilers do not
have access to the critical function call frequency data, they cannot
compute optimal segmentation, and thus do not attempt to do so.

A simple but illustrative database application with four functions
demonstrates why frequency data is crucial. The functions are:

open; close; read; and write (the main menu)

Given the nature of the application, the static (possible) function calls
would be:

   main calls open
   main calls close
   main calls read
   main calls write

Let's assume that there are only two segments in this application (one with
two functions and one with three functions). On the basis of the static
data shown above, how would you (or any algorithm you conceived) know
where to place each function?? Strictly considering this information, you
would conclude that the solution has an equal chance of being any of the
following allocations. (Each solution binds main with two of the four
functions it calls.)

   SG1 main;open;close        SG2 read;write
   SG1 main;open;read         SG2 close;write
   SG1 main;open;write        SG2 close;read
   SG1 main;close;read        SG2 close;write
   SG1 main;close;write       SG2 open;read
   SG1 main;read;write        SG2 open;close

Now, let us consider the solution if we had access to the application's
runtime call history. When the program is exercised we learn that the
functions actually occur in this sequence.

     >main
       open
       read
       write
       read
       write
       read
       write
       read
       close
     >main

When consolidated into function calls and their respective frequencies, the
above call history is the equivalent of:

  Function call           Frequency
  main calls open              1
  main calls closed            1
  main calls read              4
  main calls write             3

Based on this data it is overwhelmingly clear that main, read, and write
belong in segment 1 and open and close belong in segment 2. (This solution
had only a 17% likelihood of being selected had the runtime data not been
considered.)

SegMentor -- an automated solution to the segmentation problem

Capturing the critical data

The first step in using SegMentor is to build a database on the application
(.exe or .DLL) that is being segmented. The basic data needed is a list of
functions in the application, their sizes, and the runtime calling
relationships among the functions. But in order to collect the runtime
calling data (and in order to make some other things work) some secondary
data is also needed. This includes the name of each function's default
segment and source code file.

Data is drawn from three sources: the application's .map file, the
application's source code, and the 'log files' that are generated which
contain the application's runtime function history.

.map file data:

The .map file provides the name of each function in the application
together with its (executable) size. It also provides the name of each
function's default segment (this is used in tracing the application's
function activity).

Unfortunately, functions declared as static don't show up in an
application's .map file. To get around this problem, a SegMentor utility
called fixup is used to change the static keyword (when used in reference
to functions) to a macro, such as LOCAL. Then, by defining this macro as
'nothing', it is possible to compile without static functions and thus
generate a full .map file (for normal compiles, the macro can be defined
as 'static').

Source code (.c file) data:

The application's source code points out those functions in the
application's .map file for which there is no corresponding source code.
This may be because they are from the standard library, or from
third-party libraries, or because they're written in assembly. In any
case, without source code there's no mechanism (in Microsoft C) for
assigning a function to a particular segment, so such functions are
ignored by SegMentor, and they wind up in default segments based on how
they're grouped into object files. (Note: SegMentor can trace and segment
assembly code, just not as conveniently as C.)

The source code also provides an un-mangled function name, a prototype for
each function, and the name of the source file in which each function is
found. This information is used in conjunction with the alloc text pragma
when integrating a segmentation scheme back into the application. The
source code is also used in determining whether a function is far or near,
and whether it uses the Pascal or C calling convention. This information
is necessary to configure the tracing hooks properly.

Last, the source provides 'static' who-calls-who information. Though
frequency information is not available, the source code does describe all
the nonhidden function calls that occur in the application. This is used
to supplement the runtime function call history. If a given function
doesn't appear in the runtime history because the developer does not
exercise it for some reason, the static function-link information can then
be used to make reasonable segment-placement decisions for that function
(these static links are assumed to have a frequency of one).

Runtime function call (.log file) data:

To get a runtime function call history for an application, a 'traceable'
version of the application must be built. First, the application is
compiled with stack checking enabled. This causes a call to the chkstk
function to be placed at the beginning of each function in the
application. Then a special library called trace . lib is built using the
application's database and a SegMentor utility. trace . lib contains a
special version of chkstk which, in addition to performing chkstk ' 8
other duties, logs function calls and function returns to a disk file.

How does MicroQuill's chkstk work?

Each time it is called it determines which application function called it
(and in that way, which application function was most recently invoked) by
examining the return address on the stack which corresponds to the call to
chkstk. The return address is correlated with information from the
application's database (which is coded into trace . lib) in order to
figure out which function made the call. Also, in order to trace function
returns, chkstk saves the return address that corresponds to the call to
the current application function, and replaces it with a call to a
MicroQuill routine called rtn-hook. That way, each time an application
function does a return, rtn-hook gets invoked. When rtn-hook is called, it
logs the function return to disk and then does a jump to the saved return
address.

Once a 'traceable' version of the application has been built, it can be
thoroughly exercised in order to generate a log file of runtime function
call activity. Ideally, each part of the application should be exercised.
Furthermore, it is best if each part of the application is exercised in
proportion to the percent of time that it's used. So if end users spend
75% of their time using option X, then realistically 75% of the log file
data should relate to option X (even if option X only represents 10% of
the application's code).

One option with SegMentor is to create separate log files for the different
parts of an application. Then, when a log file is incorporated into the
database, it can be weighted more heavily by having its data multiplied by
some value. For example, you might have the log file data for option X
weighted 7 times more heavily than it normally would be. (Another option
would be to create a single log file, and then just spend lots of time
exercising option X -- but in Practice, this latter approach can be
awkward, as it might dramatically increase the amount of log file data
needed for good segmentation).

For a large application, the log file(s) produced will typically be several
megabytes in size (2-30 bytes are needed to store each function call and
each function return).

Computing a new segmentation layout

Using SegMentor utilities, the above data is gathered into a database file.
Then a utility called segment searches for an optimal segmentation for the
application. In addition to the application's log database, segment
accepts several parameters, such as the target segment size that is
desired (4K is recommended). segment will crank away searching for
solutions for as long as desired; whenever it finds a solution better than
all previous ones, it is saved. For a large application, segment usually
finds a good solution in one or two hours. Running it overnight produces
the best results.

Segment rates a given segmentation scheme by adding up the cross-segment
function calls that the scheme produces. The fewer the cross-segment
calls, the better. In our previous example, the best case segmentation
would be given a rating of 20, while the worst case segmentation would be
given a rating of 220. This rating function is optimum for reducing far
calls and also produces a minimum number of 'disk-hit' calls when the
application is run in a low-memory environment.

Note: The most accurate rating function for minimizing diskhit calls would
be to simulate the sequence of calls that occur as the application runs,
assuming a given amount of available memory and an LRU swapping policy.
This would give a precise count of the number of disk-hit calls that a
given segmentation would produce. However, such a rating function is
impractical because it's so expensive to compute. And it turns out (not
surprisingly) that the number of cross-segment links is an excellent
predictor (given reasonable computation time) of the number of disk-hit
calls.

To start, segment makes an initial placement of functions into segments,
using a mini-max algorithm. In addition to making a reasonably good
initial placement, the mini-max algorithm produces a random initial
placement, in the sense that each initial placement is different. Once the
initial placement is made, segment begins a phase in which it randomly
moves functions around from one segment to another in an effort to improve
the segmentation's rating. In this phase, segment can either do 'iterative
improvement' or 'simulated annealing'.

In iterative improvement, only moves which lower (and thereby improve) the
segmentation's rating are accepted. segment accepts 'downhill' moves (that
is, moves which lower the segmentation's rating value) until no more
downhill moves can be found. At that point, it makes a new (and different)
initial placement and repeats the process.

With annealing, segment will initially accept uphill moves (moves that
raise the segmentation's rating value). For example, immediately after
making the initial placement, segment might accept a move that raises the
rating by 50. Over time, the size of the uphill move that segment accepts
is reduced, until it finally reaches 0. At this point it will no longer
accept uphill moves of any sort, and the process becomes one of iterative
improvement.

If you imagine the solution space to be a 2 dimensional (hilly) plane, in
which altitude corresponds to a segmentation's rating value, then with
iterative improvement segment can only drop to the bottom of the closest
valley or hole that the initial placement puts it in. With annealing,
segment can climb out of the hole to 'explore' neighboring holes. This
gives it the possibility of finding deeper holes. Both techniques work
well. Iterative improvement generally finds good solutions faster than
annealing. Given more time (usually 24+ hours on a fast CPU), annealing
tends to produce slightly better solutions.

Integrating the new layout back into the application

Once segment has been run on an application's database, all that is left to
do is to integrate the resulting segmentation scheme back into the
application. With C, the mechanism that is used to 'enforce' the
segmentation is the alloc text pragma, which allows one to place a
function in a particular segment.

To do this with SegMentor, the utility makeinc is run. makeinc reads the
database and outputs a single include file called project.h. This include
file contains all of the necessary alloc text pragma's for the
application. As part of the preparation for using SegMentor, a #include of
project.h must be added to each of the .c files that make up the
application (the fixup utility can add this at the same time it 'removes'
static functions from the application).

With project.h built, the application is re-compiled and re-linked (this
time without stack-checking and without the trace . lib file) and . . .
viola! A well-segmented application is the result.

NOTE: The tracing technique for the Zortech compiler utilizes prolog and
epilog data instead of the chkstk data. This version also uses a post
compile object patching technique (as opposed to the alloc text pragma) to
integrate the new segmentation. Work is ongoing to support the allocation
of functions to code pages in NT and other environments.

MicroQuill also sells SmartHeap, a professional memory management library
and debugging toolkit for Windows developers. If you have further
questions, please call MicroQuill at 206-525-8218, or Fax us at
206-525-8309.

MicroQuill Software Publishing Inc
4900 25th Ave NE, #206, Seattle, WA 98105

==========================================================
From the America Online -- New Product Information Service
==========================================================
This information was processed from data provided by the
above mentioned company. For additional details, contact 
the company at the address or telephone number indicated.
==========================================================
All submissions for this service should be addressed to:
BAKER ENTERPRISES, 20 Ferro Dr, Sewell, NJ 08080 U.S.A.
Email:  RBakerPC  (AOL),   rbakerpc@aol.com  (Internet)
==========================================================
