.nr PS 12
.nr VS 15
.TL
The Magic Lantern Distributed System Management System
.AU
Robbert van Renesse
.AI
Computer Science Department
Upson Hall
Cornell University
Ithaca, NY 14850
.AB
This document is terribly out-of-date, but you may read it if you wish.
.AE
.NH
Introduction
.PP
An, often neglected, aspect of distributed computer systems is their
management.
If the management of a monolithic computer system is hard enough, the
management of distributed computer systems is exponentially more so.
Distributed computer systems provide many more ways to configure the
system itself and the applications running on them.
Distributed computer systems also have many more failure modes and
ways to recover.
Effective management can have dramatic effects on, among others, the
performance, robustness, and scalability of the system.
.PP
A popular and powerful trend in computer systems, and distributed
computer systems in particular, is to separate mechanism and policy.
As a result, many mechanisms have been developed for distributed
computer systems:  mechanisms for computation, communication, storage,
access protection, fault tolerance, concurrency control, and
accounting.
Relatively little effort has been devoted to policy, and most of the
work is theoretical work on simplified computer models with often
unrealistic assumptions.
.PP
Policy decisions for real computer systems include the scheduling and
placement of processes, the placement of replicas on storage servers,
network routing, how to recover from failures, and trade-offs between,
say, consistency and performance.
These decisions are not easily captured in an algorithm that will
perform well under unexpected circumstances.
And unexpected circumstances are to be expected, at which point the
system may start trashing, or display other kinds of unwanted
behavior.
An example of such a circumstance, and one that occurs often, is when
a system or application is used on a much larger scale than it was
designed for.
.PP
In the event of a problem, fast, often manual intervention is
necessary to keep the system from coming to a grinding halt.
But even when there is no problem, continuous adaptations to policies
are required to provide optimal performance and to reduce the
probability of a real problem.
It is this that we call management.
.PP
Management has a passive and an active side.
The passive side monitors the system, detects errors, locates
bottlenecks, but also under-used parts of the system, all while trying
to have a minimal influence on the system itself.
The active side exercises control over the system, and is able to
change the configuration, to disable, enable, or exchange system's
components, or to update system's parameters.
.PP
Management has an evolving character.
At first most control will be manual.
When the manager gets the ``feel'' of the system, and knows the
effects of the different parameters on the system, he or she may
automate this part of management by implementing a heuristic.
But at any time unexpected system behavior may need manual
intervention.
.PP
This document describes a powerful tool providing management mechanisms
which can be invoked manually, or to implement policies.
.NH
Model for Distribution
.PP
Before describing the design of the management system, we will look at
what forms distributed system can assume.
In this section we will develop a system-independent abstraction of
distributed systems.
Later we will need this abstraction for the design of the management
system.
.PP
A distributed system is a collection of functionally or physically
separated components, cooperating in some way on some task or
application.
To accomplish this task, the components usually communicate or
synchronize by means of a shared medium.
To keep our management tool versatile, we will not further specify the
extends of distributed systems.
The components need not be, or run on, physically different computers.
Neither will we prescribe how the components communicate.
Examples of distributed systems are distributed computer systems,
distributed data bases, parallel numerical processes, automated
factories, but also airplanes and modern cars.
.PP
The components of a system can be small or large.
Components may be as small as single two-valued bits, or as large as,
say, a storage service in a distributed operating system.
The components may be software, hardware, or both.
Examples of hardware components are valves, electro-magnets, or
motors.
The components may be distributed systems themselves, such as a
replicated storage service, which consists of physically separated
storage servers.
.PP
What exactly the components of a system are depends on what we want to
manage.
The same system may be subdivided in different ways.
For example, a distributed operating system may be viewed as a set of
computers connected by a set of networks and bridges.
Alternatively, the same system may be viewed more functionally as a
set consisting of a storage service and workstations communicating by
messages.
We will call the way in which the manager subdivides the system into
components the
.I view
on the system.
.PP
Given the components, there may be several ways to connect the
components into a system.
The connections determine how the components communicate or
synchronize, and this may affect the system's performance and
security.
We will call the way into which the components are connected the
.I configuration
of the view of the system.
The system's robustness and scalability depend heavily on how the
manager is able to change the configuration.
.NH
Model for Management
.PP
If we wish to make a versatile tool for management, we will have to
provide an abstraction of the components, a uniform interface to
monitor and control them, and a way to describe the configuration.
We will deal with the abstraction in this section, leaving the
interface and configuration description for the next two sections.
.PP
Depending on the component, there may be few or many characteristics
that we want to measure or influence.
The devices that measure characteristics are called
.I sensors ,
and the devices that try to change characteristics are called
.I actuators .
The abstraction of a component consists of being able to access sensors
and actuators only, leaving the rest of the component as a black box.
.PP
Both sensors and actuators are numerical devices, meaning that their
state can be captured in a single numerical value.
We do not need ``single-shot'' types of sensors and actuators.
Sensors that detect events like, say, an electron passing in a
cyclotron, can simply be transformed into a numerical sensor that
counts the number of electrons that have passed.
Similarly, an actuator that, say, fires an electron, can be modeled as
incrementing the value of the actuator.
.PP
Some sensors or actuators are immediately provided by the components,
such as the cache size in a file server.
Others, like the hit ratio of the cache, have to be added or derived
in another way.
The hit ratio can be derived by values from a sensor that counts the
number of hits, and another sensor that returns the time.
The hit ratio can be influenced (controlled is too rigid a term here)
by changing the size of the cache.
.PP
Sensors can be derived from other sensors in a multitude of ways.
First are simple numerical operations on sensors, such as adding or
dividing.
Also boolean operations are common, for example used by a sensor that
detects when the hit ratio of a cache goes below a certain limit.
It is also possible to define ``reliable sensors'' out of a set of
identical less reliable sensors.
Other derivations often involve time intervals, such as a sensor that
computes the average hit ratio over the last five minutes.
.PP
Actuators are more complex, and often involve feedback from sensors.
For example, a temperature regulator will activate a heating element
if the temperature sensor is below the prescribed value, and vice
versa.
An actuator for the hit ratio of a cache will increase the cache size
and watch the hit ratio sensor.
.PP
Actuators often have delayed effects, and worse, several side-effects.
These side-effects may even occur in other components, possible of
different applications.
For example, increasing the cache size will reduce the amount of memory
left for other applications.
Actuators are usually harder to design and implement than sensors.
For example, having defined the performance of an application, it is
usually simple enough to make a sensor for it, but it is usually
unclear how to maximize the application's performance.
.PP
We call the collection of sensors and actuators of a component a
.I station .
The station names the sensors and actuators, provides the interface,
and all other monitoring and control functionality that is not part of
the component itself.
Stations can communicate with each other to allow exchanges of
monitoring and control information.
.PP
To minimize the effect of monitoring on components, stations will run
close to their associated components, and deal with as much of the
management issues as possible.
The stations, as it were, are filters, processing as much of the
monitoring data as possible.
The set of stations are therefore usually configured in the same ways
as the system itself, and form a distributed application.
Consequently, the management system will have to be able to manage
itself.
.NH
Design
.PP
The design of the MAGIC LANTERN management tool is based on stations.
The concept of stations is extended to allow hierarchical composition.
Thus components can be combined into a new component with its own
management station.
Consequently, the design of the complete management system is the
design of a single station.
.PP
A MAGIC LANTERN station features the following:
.RS
.IP \(bu
naming and access of sensors and actuators;
.IP \(bu
a data base that maintains the history of the sensor and
actuator values;
.IP \(bu
function evaluation
.IP \(bu
communication to other stations;
.IP \(bu
event detectors that take action on certain events;
.IP \(bu
a visual user interface to this data base;
.IP \(bu
data base templates to create new data base structures.
.RE
.LP
By manipulating the data base, actuators get triggered automatically.
The data base can be manipulated directly through the visual
interface.
.NH 2
The Historical Data Base
.PP
The MAGIC LANTERN data base is organized in
.I records .
Each record may contain an arbitrary collection of
.I attributes .
The value of an attribute depends on a
.I "sequence number" .
The sequence number is used to allow retrieval of old values.
A value is a function of other values in the data base, and returns
either an ASCII string or null.
Usually values are constant functions, independent of the rest of the
data base.
.PP
Associated with the data base is the
.I "current sequence number" .
Values updated in the data base are stored under the current sequence
number.
By increasing the current sequence number, the changes in the data
base are committed atomically.
It is not possible to update the history of the data base.
.PP
Records and attributes are named by ASCII strings.
Thus string values can be retrieved by specifying two strings and a
sequence number.
The current sequence number is globally available to allow retrieving
the current value of an attribute.
.PP
The interface to the data base is the following:
.DS
STORE ( RECORD, EXPRESSION, FUNCTION )
RETRIEVE ( RECORD, ATTRIBUTE, SEQNO ) \(-> ( VALUE, SEQNO' )
CURRENT ( ) \(-> SEQNO
COMMIT ( )
CHECKPOINT ( )
.DE
.LP
STORE() updates the function of possible a set of attributes.
RECORD specifies the string name of the records, and EXPRESSION the
set of attributes.
What exactly can be expressed with EXPRESSION is currently left to the
implementation, but should include at least literal specification of
attributes.
.PP
FUNCTION is invoked each time a corresponding attribute is retrieved.
It gets the same arguments as the corresponding RETRIEVE(), and
returns a string value and a sequence number which are subsequently
returned by RETRIEVE().
SEQNO specifies where to look in history.
CURRENT() returns the current sequence number if the current value of
the attribute is to be retrieved.
The sequence number SEQNO' is the sequence number when the attribute
obtained the corresponding value.
.PP
FUNCTION is often a constant string, which can be returned by
RETRIEVE() immediately without further evaluation.
Another common function is a so-called
.I "link function" ,
which returns the value of another attribute of possibly another
record.
Initially all attributes are assigned the constant null function, such
that retrieval of unassigned attributes results in null values.
In the next section we discuss functions to greater detail.
.PP
COMMIT() is used to increment the sequence number if any changes were
made to the data base since the last commit.
If not, COMMIT() does not do anything.
If changes were made, they may possibly be logged for crash recovery,
but this depends on the implementation.
.PP
CHECKPOINT() write the whole data base to a stable file, and, if
necessary, truncates the log file.
CHECKPOINT() is an expensive operation which should not be invoked
very often.
However, not invoking it at all may cause all information to be lost
if no log file is maintained, or may cause the log file to grow too
large if it is.
.PP
The data base can be used to store the values of sensors and
actuators, the configuration of a system, the contents of the visual
display, information about communication between modules, and other
things like, for example, a data dictionary.
The data dictionary is provided by a set of special functions.
.PP
Data in the data base is represented by string values.
If other types of data need be stored, this has to be encoded in some
way.
Integers and floating point numbers are easily and portably
represented.
A record type is directly representable as a data base record.
A list can be stored in a record, with the attributes serving as
indexes.
For example, a list of ten elements could be stored in a record with
attributes ``0'' through ``9.''
.PP
One of the mostly used structures is the table structure.
In the MAGIC LANTERN system a table is built from a list of record names.
Each record represents a row in the table.
Associated with each of the columns is an attribute in the record.
More complex structures such as graphs can be formed by storing the name
of a record in an attribute of another.
.PP
As time may also be provided as a value in the data base, we can match
sequence number to real time.
This way it is possible to see at what time an attributed changed its
value.
.NH 2
Functions
.PP
The MAGIC LANTERN distributed management system supports a postfix expression
evaluator.
In postfix notation, operators appear after the values they operate
on.
For example, ``2 4 +'' is a postfix expression that results in ``6.''
We represent a postfix expression by a list of so-called
.I tokens .
In our previous example, the tokens are ``2,'' ``4,'' and ``+.''
.PP
The expressions are evaluated strictly from left to right using a
stack.
Tokens that are not operators are pushed onto the stack.
Operator tokens replace zero or more tokens from the top of the stack
by an arbitrary amount of other tokens.
For example, ``+'' replaces two tokens from the stack by a single
token containing the sum.
Operators may be quoted to remove their special meaning, so that they
may be pushed onto the stack unevaluated.
.PP
The basic operators are listed in Appendix A.
They are divided into mathematical operators, boolean operators,
string operators, special operators, and general operators.
Mathematical operators include operations like the sine of a value.
Boolean operators like ``<'' and ``or'' represent
.I false
by zero, and
.I true
by non-zero values.
.PP
String operators apply to ASCII strings.
An example of such an operation is string comparison.
Special operators have specialized functions.
An example of a special operator is ``#'', which takes a record and an
attribute as input tokens, and produces the value that is retrieved
from the the data base as output token.
.PP
General operators apply to the postfix machine.
Examples are discarding or duplicating tokens on the top of the stack.
Two operators deserve special attention, as they can be used to
operate on list and tables.
.PP
The first one is the ``$list'' operator.
This operator takes a record name and an operator as input.
The record contains a list represented by attributes ``N'' containing
the number of elements, and the elements themselves in attributes 0
through N - 1.
$list then applies the operator to each element in the list, and
finally pushes the number of elements N.
The operator can access the element through the $entry operator.
.PP
The second important operator is $iter.
This operator takes a number N and an operator as input tokens, and
applies the operator N times.
For example, say that we have a table called ``T,'' and we wish to
computer the sum of the squares in the row named ``age.''
This is computed by the following function:
.DS
T "$entry age # $dup *" $list 1 - "+" $iter
.DE
.LP
The $list operator produces the squares of all the ``age'' elements of
T on the stack, and the number of elements N.
The ``-'' operator subtracts one of this number, since we need to
apply the ``+'' operator only N - 1 times.
$iter then produces the sum.
Note that the ``+'' operator has been quoted to prevent it from being
evaluated immediately.
.PP
Users can store these functions in the data base by quoting them.
Other functions can invoke quoted functions in the data base using the
``#'' and ``$eval'' operators.
$eval evaluates the token on top of the stack.
For example, ``x y # $eval'' retrieves the function that is stored in
the attribute called y in the record called x, and evaluates it.
.NH 2
Communication
.PP
Communication between stations is by asynchronous request/response
exchanges.
A request or response message is a list of ASCII strings.
The first two elements of a message are defined.
The first element is the name of the source station of the message,
and the second element its sequence number at the time it sent the
message.
In case of a request message, the third element contains the name of a
function to be invoked in the destination station.
.PP
The communication interface comprises the following two procedures:
.DS
SUBSCRIBE ( COMMAND, REQUEST-PROCEDURE )
SEND-REQUEST ( STATION, MESSAGE, REPLY-PROCEDURE )
.DE
.LP
SUBSCRIBE() specifies that when a request message is received with
COMMAND as its third element, the REQUEST-PROCEDURE is to be invoked
with the request message as its argument.
REQUEST-PROCEDURE returns a reply message which is subsequently sent.
.PP
SEND-REQUEST() sends MESSAGE to STATION.
When a reply message is received, the REPLY-PROCEDURE is invoked with
the reply message as its argument.
If STATION does not exist, or crashes during the handling of the
request, the REPLY-PROCEDURE is invoked with a null reply message.
The reply procedure does not return any value.
.PP
Both request and reply procedures run uninterruptedly to completion.
At the end of these procedures, COMMIT() is invoked automatically,
atomically committing changes to the data base.
At the start of these procedures, the source station and its sequence
number can be stored in the data base, thus maintaining a partial (or
causal) order between the event histories of the different stations.
.NH 2
Event Detection
.PP
An
.I event
is a changing value in the data base.
It is possible to set a
.I watch
on an attribute in the data base.
Each time the attribute changes, the watch sens a request message to a
specified station.
Note that it is not necessary to update the attributed directly to
trigger the watch, as attributes are bound to functions.
However, watches can only be triggered by changes in the data base.
.PP
The routine to set a watch is:
.DS
WATCH ( RECORD, ATTRIBUTE, STATION )
.DE
.LP
Each time ATTRIBUTE in RECORD changes, a message containing RECORD,
ATTRIBUTE, the new value, and, as always in messages, the source
station and its sequence number is sent to STATION.
.PP
By setting watches, stations can keep copies of values in multiple
data bases.
Often the station specified in a watch is the station that holds the
record itself.
This way a request message task is started each time the attribute
changes, so that the station can take appropriate control actions.
.PP
For efficiency it may be sufficient if a watch reports updates at
most, say, once every ten seconds, or only if the value of attribute
has changed over a specific amount.
To implement this, the user first defines a new attribute with those
characteristics, based on the attribute to watch.
Then the user puts a watch on the new attribute.
.PP
For example, it is easy to define a function on an attribute and the
time (in seconds), which, at time T, returns the value of the
attribute at time 10 * (T div 10).
By assigning this function to a new attribute, we have created a
sensor that changes value at most once every 10 seconds.
.NH 2
Visual Interface
.PP
An important part of a MAGIC LANTERN station is its (optional) visual user
interface.
This interface provides a user-friendly way to display and update the
data base, and, as the data base represents sensors and actuators,
convenient access to monitoring and control of the station.
.PP
The interface is an object-oriented graphical ``toolkit.''
The toolkit supports the creation of
.I widgets ,
which represent ports of the data base.
Examples of widgets are graphs, dials and buttons.
All widget-related information, such as position, size, and where to
get or put relevant data, is described in the data base itself.
.PP
The interface defines a relatively small set of standard widgets.
These include widgets for drawing lines, displaying text, the hands of
a dial, buttons, scroll bars, graphs, and menus.
By editing the text widgets, or manipulating buttons or scrollbars
with the mouse, the data base can be updated directly.
.PP
Another important widget is the
.I "picture widget" .
The picture widget is a compound widget, consisting of a set of other
widgets.
The widgets within are automatically positioned and scaled relative to
the picture widget's position and size.
A picture widget may include itself, creating an effect like a camera
viewing its own monitor.
.PP
For example, we can build a clock widget as follows.
First we create a record with attributes for the angles of the hour,
minute, and second hand.
The attributes are relatively simple functions of time.
Next we build the widget out of widgets displaying the lines and text
on the plate, and the three hands.
The hands get their data from the newly defined attributes.
.PP
The window is displaying the top-level widget, which is a picture
widget.
Because the specifics of a widget are stored in the data base, the
toolkit has a number of powerful features not commonly found in other
toolkits.
For example, because attributes of widgets, like position and size,
are stored as functions in the data base, the toolkit supports moving
and growing widgets.
Furthermore, widgets can be directly edited on the screen at run-time,
eliminating the need for picture editors or complex code to lay out
the window contents.
Another consequence is that information is stored independent of the
graphical interface, so that different windows and other graphical
systems are supported.
A user could use, say, X11 windows for interactive work, and
PostScript to print the data on paper.
.PP
An important feature of the user interface is the history bar.
This is a specific widget which is tied to a sequence number.
By manipulating the history bar with the mouse, we can display the
state of the data base in history, and see the sequence of events that
led to the current situation.
.NH 2
Templates
.PP
To be provided.
.NH
Station Composition
.PP
A system, or, more precisely, a system's view, may be built up from
several components.
The components, in turn, may consist of multiple other components.
The components are tied together according to some configuration.
We can speck of parent, child, and sibling components according to
their relation in the component family tree.
.PP
Each component, on any level, has its own station that monitors and
controls.
Specifically, at the top level of the family tree, there is a station
that manages the complete system.
The stations describe the configuration of their child components, if
any, in their data bases.
The children ``watch'' the relevant configuration data as stored in
their parents.
When this data is changed, the children may have to take appropriate
action, such as terminating, or talking to other components.
.PP
Each station supports a set of standard request messages to simplify
composition.
They are
.RS
.IP \(bu
RETRIEVE
.IP \(bu
STORE
.IP \(bu
CHECKPOINT
.IP \(bu
WATCH
.RE
.LP
The RETRIEVE and STORE request messages allow retrieval and updates of
data bases in remote stations.
STORE can also be used to terminate a station, by setting the ``status''
attribute of the station's ``control'' record to ``stop.''
Each station is watching this attribute and cleans up if necessary.
.PP
The CHECKPOINT message requests a station to checkpoint its data base,
and is typically sent by a parent station to its children.
The WATCH request message tells a remote station to set a watch on a
certain attribute and report events back.
.bp
.NH
Appendix A
.LP
This appendix lists the operators along with the number of arguments.
In their description the arguments on top of the stack are denoted by
$1, $2, ... .
.de OP
.ta 0.3i 1.5i 2i
.br
	\\$2	\\$1	\\$3
..
.NH 2
Mathematical Operators
.LP
.OP 1 $sign "sign of $1 (-1, 0, or 1)"
.OP 1 $abs "absolute value of $1"
.OP 1 $neg "\- $1"
.OP 2 $min "minimum of $1 and $2"
.OP 2 $max "maximum of $1 and $2"
.OP 2 +  "$1 + $2"
.OP 2 -  "$1 - $2"
.OP 2 *  "$1 * $2"
.OP 2 /  "$1 / $2"
.OP 2 $div "integer divide: $1 / $2"
.OP 2 $mod "integer modulo: $1 % $2"
.OP 2 %  "ditto"
.OP 1 $round "round to nearest integer value"
.OP 1 $trunc "round down"
.OP 2 $pow "$1 ^ $2"
.OP 1 $exp "exp($1)"
.OP 1 $log "log($1)"
.OP 1 $sin "sin($1)"
.OP 1 $cos "cos($1)"
.OP 1 $tan "tan($1)"
.OP 1 $asin "asin($1)"
.OP 1 $acos "acos($1)"
.OP 1 $atan "atan($1)"
.OP 2 $atan2 "atan($2 / $1)"
.OP 1 $sinh "sinh($1)"
.OP 1 $cosh "cosh($1)"
.OP 1 $tanh "tanh($1)"
.OP 1 $asinh "asinh($1)"
.OP 1 $acosh "acosh($1)"
.OP 1 $atanh "atanh($1)"
.OP 1 $j0 "Bessel j0($1)"
.OP 1 $j1 "Bessel j1($1)"
.OP 2 $jn "Bessel jn($1, $2)"
.OP 1 $y0 "Bessel y0($1)"
.OP 1 $y1 "Bessel y1($1)"
.OP 2 $yn "Bessel yn($1, $2)"
.NH 2
Boolean Operators 
.LP
.OP 2 <  "$1 < $2"
.OP 2 <= "$1 <= $2"
.OP 2 == "$1 == $2"
.OP 2 != "$1 != $2"
.OP 2 >  "$1 > $2"
.OP 2 >= "$1 >= $2"
.OP 2 && "$1 and $2 are both true"
.OP 2 || "at least one of $1 and $2 are true"
.OP 1 !  "$1 is false"
.NH 2
String Operators
.LP
.OP 1 $quote: "put necessary quotes and escapes in $1 to quote it"
.OP 2 $cmp: "string compare of $1 and $2 (returns -1, 0, or 1)"
.OP 1 $length: "string length of $1"
.OP 2 $concat: "concatenate $1 and $2"
.OP 3 $concat3: "concatenate $1, $3, and $2"
.NH 2
Special Operators
.LP
.OP 1 $time "convert $1 to printable string (a la ctime)"
.OP 2 # "retrieve attribute $2 in record $1"
.OP 3 $get "retrieve value and seq # of $2 in $1 at seq $3"
.OP 0 $db_ddcount "return #records in database"
.OP 1 $db_datadict "return $1-th record"
.OP 1 $db_count "return # attributes in $1"
.OP 2 $db_attr "record = $1.  Return $2-th attr"
.NH 2
General Operators
.LP
.OP 0 $nop "do nothing"
.OP 1 $pop "discard $1"
.OP 3 ? "if $1 then $2 else $3"
.OP 1 $dup "replace with $1 and $1 again"
.OP 2 $swap "replace with $2 and $1"
.OP 1 $eval "evaluate $1"
.OP 2 $iter "evaluate $2 $1 times"
.OP 2 $list "evaluate $2 on all entries in record $1"
.OP 3 $sort "return $2th attribute in list $1 if compare function is $3"
.OP 3 $select "return $2th attribute in list $1 for which $3 is non-zero"
.OP 2 $selcnt "return # attributes in list $1 for which $2 is non-zero"
.NH 2
Constants and Variables
.LP
.OP 0 $null "push a null value"
.OP 0 $e "exp(1)"
.OP 1 $pi "3.14159 ..."
.OP 2 := "assign $1 to $2"
.OP 0 $db_seq "current sequence number"
.OP 0 $record "the record that contains the function"
.OP 0 $attr "the attribute that contains the function"
.OP 0 $entry "the entry name in a $list and $select operations"
.OP 0 %1,%2 "the entries that are being compared in $sort"
