address.tex 666 372 212 26633 4673525102 6337 \section{Sites}
The term ``site'' refers to a machine that has ISIS running on it.
\Marginpar{{\em Site and incarnation numbers, site-id's}}
Sites are organized into clusters\index{cluster of sites}.
Although each
cluster can have thousands of ISIS sites, at most
254 can actually run ISIS, with the remainder grouped around these
as ``remote clients''.
More commonly the number of ISIS sites is kept fairly small, with
few applications running ISIS directly on more than 100 sites,
and other machines hooking into ISIS through the remote access
feature.
Communication {\em between} clusters requires that your application pass
through the long-haul communication facilities, and is not transparent.
The sites that actually run ISIS
in a cluster are assigned a unique positive number,
between $1$ and $254$
through the ISIS ``sites'' configuration file (see Appendix A).
This number is called the {\em site number}.\index{{\tt site\_id}, site numbers}
The sites that function as remote ISIS
clients do {\em not} receive site numbers; processes on them are always
listed as being on site 255.
ISIS also gives each site an incarnation number, which is initially $0$.
It is incremented by $1$ each time a site crashes and then comes back up.
The site-id of a site contains its site number and its incarnation number.
The type {\tt site\_id} \index{{\tt site\_id}, type definition} is defined as follows.
\begin{verbatim}
typedef short site_id;
\end{verbatim}
Since site-number 0 is not used,
lists of site-id's are terminated by a null site-id.
Site-id's are accessed using these macros.\index{{\tt site\_id}, macros for manipulating}
\begin{verbatim}
MAKE_SITE_ID (s_no, incarn) /* Return a site-id containing */
/* the given site number and */
/* incarnation number */
SITE_NO (s_id) /* The site number contained in the */
/* given site-id */
SITE_INCARN (s_id) /* The incarnation number contained */
/* in the given site-id */
\end{verbatim}
The following global variables are made available to ISIS applications.
\index{\tt my\_site\_no}\index{\tt my\_site\_incarn}\index{\tt my\_site\_id}\index{\tt my\_host}
\begin{verbatim}
int my_site_no; /* Site number of this site */
int my_site_incarn; /* Incarnation number of this site */
site_id my_site_id; /* Site-id of this site */
char my_host[64]; /* Full host name of this site */
char site_names[n][];/* Machine name of site-no n */
\end{verbatim}
The ISIS system keeps track of which sites are currently active in the
cluster and stores this information in the site view structure at every
site.
This structure is updated whenever a new site joins the cluster, an
existing site is deemed to have crashed, or a site that was down comes back
up.
ISIS runs a protocol that ensures that every operational site in a cluster
sees the same set of joins, crashes or recoveries, {\em and in the same
order}.
Crashes are detected by the system using timeouts.
This means that if a site does not respond for a long time, it might be
considered to have crashed even when it is just overloaded.
If this happens, the system shuts off all communication to and from this
site and will force it to temporarily halt and come back up again with a
new incarnation number (when this happens, a message is printed in the log
of the affected site: {\tt I am dead. Site xxx/xxx told me so}).\index{\em I am dead. Site xxx/xxx told me so.}
This is necessary if we want to ensure that all joins, crashes and
recoveries are observed in the same order everywhere.
The timeout interval is large, and is adaptively changed if the measured
response time for a site seems high.
The rule used is such that the probability of an operational site
being forced to undergo recovery is rather small.
The site view structure can be examined by calling {\tt site\_getview}:\index{\tt sv\_getview}
\begin{func}{Get information about the site view}
\begin{verbatim}
sview *
site_getview()
\end{verbatim}
\end{func}
{\tt site\_getview} returns a pointer to the site view structure, which is
defined as follows.\index{{\tt sview} type definition}\index{site views, structure defined}
\begin{verbatim}
typedef struct
{
int sv_viewid;
site_id sv_slist[MAX_SITES + 1];
u_char sv_incarn[MAX_SITES + 1];
bitvec sv_failed;
bitvec sv_recovered;
} sview;
\end{verbatim}
\begin{description}
\item[{\tt sv\_viewid}] is initialized to $0$ and is incremented each time the
site view is changed.
\item[{\tt sv\_slist}] is a list of the site-id's of all operational sites
(terminated by $0$).
\item[{\tt sv\_incarn[i]}] gives the incarnation number of the site with site
number $i$.
If the site is down, its incarnation number is the defined constant
{\tt DOWN\_INCARN}.
\item[{\tt sv\_failed}] and {\tt sv\_recovered} are bit vectors describing
which sites failed or recovered between the last site view and the present
one.
In {\tt sv\_recovered} we also include sites that join the system for the
first time.
See Appendix B for a description of the {\tt bitvec} type and the routines
used to access the bits. For manipulation of the
failed and recovered bit vectors, the most important of these is {\tt bis(bvec,bno)},
which tests to see if bit {\tt bno} is set in bitvec {\tt bvec}.
\end{description}
The macro {\tt SITE\_IS\_UP (s\_no, incarn)} returns {\tt TRUE} if
site number {\tt s\_no} is up and has incarnation number {\tt incarn}.
\section{Addresses}
An address
\index{addresses}\index{process, {\em address} of}\index{process groups, {\em address} of}
in ISIS is a structure used to identify a process or process
group.
It is most often used in sending messages to a process or a group.
ISIS normally deals with addresses in terms of {\em pointers} to an
address structure stored someplace in memory.
\index{addresses, permanence}
There are three cases of address storage.
The first applies to addresses that are in messages, for example
\marginpar{\em Permanence of addresses.}
the address of a message's sender (obtained using {\tt msg\_getsender()}.
When ISIS returns pointers this way, they point
into the message and remain valid as long as the message has not been
deallocated.
However, if the message goes away, the addresses in it do too.
The second case consists of addresses obtained from ISIS, such
as the address of a group to which a process belongs or is a client, or
one that it has looked up using {\tt pg\_lookup()}.
In these cases, the address is actually stored in a permanent data structure.
As long as the process remains active, the address will not be
deallocated, even if the group it refers to goes away.
The storage overhead of this approach is minor, since addresses are small.
A final case arises when a process stores an address by value; this
is uncommon in the current version of ISIS since the pointers obtained
from the system generally remain safe indefinitely.
ISIS has a few global variables containing addresses.
The global variable {\tt my\_address} contains the address by which ISIS
knows the current user process.
Another variable, {\tt NULLADDRESS}, contains a null address, which is often
used to terminate lists of addresses.
\index{\tt my\_address}
\index{addresses, \tt my\_address}
\index{\tt NULLADDRESS}
\index{addresses, \tt NULLADDRESS}
\begin{verbatim}
address my_address; /* Address of this process */
address NULLADDRESS; /* A null address */
\end{verbatim}
The following functions are used to deal with addresses; most are actually
defined as macros for speed.
\begin{func}{Compare two addresses}
\begin{verbatim}
int
addr_cmp (addr_1, addr_2)
address *addr_1, *addr_2;
\end{verbatim}
\end{func}\index{addresses, \tt addr\_cmp}
\vspace{1ex}
This function may be used to sort addresses.
It returns the $0$ if the addresses are ``equal'',
a negative integer if {\tt addr\_1} is ``less
than'' {\tt addr\_2}, and positive if {\tt addr\_1} is ``greater
than'' {\tt addr\_2}.
\begin{func}{Test addresses for equality}
\begin{verbatim}
int
addr_isequal (addr_1, addr_2)
address *addr_1, *addr_2;
\end{verbatim}
\end{func}\index{addresses, \tt addr\_isequal}
\vspace{1ex}
This function returns {\tt TRUE} if the addresses are ``equal''
and {\tt FALSE} otherwise, again with entry zero acting as a wildcard.
\begin{func}{Is this address null?}
\begin{verbatim}
int
addr_isnull (addr)
address *addr;
\end{verbatim}
\end{func}\index{addresses, \tt addr\_isnull}
\vspace{1ex}
This function returns {\tt TRUE} if the {\tt addr} is null
and {\tt FALSE} otherwise.
\begin{func}{Is this address mine?}
\begin{verbatim}
int
addr_ismine (addr)
address *addr;
\end{verbatim}
\end{func}
\vspace{1ex}
{\tt addr\_ismine} returns {\tt TRUE} if the {\tt addr} equals {\tt
my\_address} and {\tt FALSE} otherwise.\index{addresses, \tt addr\_ismine}
\begin{func}{Print a single address}
\begin{verbatim}
int
paddr (addr)
address *addr;
\end{verbatim}
\end{func}\index{\tt paddr}\index{print an address}\index{addresses, printing (\tt paddr)}
\vspace{1ex}
Prints an address using the format ({\tt site/incarn:pid.entry)}.
If a text string is associated with an entry (via {\tt isis\_entry})
that is printed, otherwise the entry number itself is printed.
The address is output to standard output with no spaces before or after it.
\begin{func}{Print address list}
\begin{verbatim}
int
paddrs (alist)
address *alist;
\end{verbatim}
\end{func}
\vspace{1ex}
Prints a list of addresses by repeated calls to {\tt paddr}.
It is not recommended that processes access the contents of the
address structure (defined in the ISIS source file mlib/msg.h) directly.
ISIS actually supports several classes of addresses, including
process addresses, group addresses, and process list addresses, and
the structure itself is maintained as a fairly messy union.
\index{addresses, structure defined}
\Marginpar{{\em {\tt address} structure}}
If absolutely necessary, the following are the major fields within
the structure:
\begin{verbatim}
struct address
{
short addr_process; /* UNIX process id */
short addr_portno; /* UDP port for direct communication */
u_char addr_site; /* Site number */
u_char addr_incarn; /* Site incarnation */
u_char addr_entry; /* Entry point to invoke */
u_char addr_pad; /* Reserved */
}
\end{verbatim}
The {\tt addr\_entry} field is filled in with the entry number when this address
is being used to send a message and is $0$ otherwise.\index{{\em entry} number, defined}
The {\em type} of address is encoded in the {\tt addr\_portno} field,
which has a negative value for non-process addresses and is
non-negative for processes.
For two addresses to be considered equal
they must first have the same values in\index{{\em incarn} number}
all fields.
This is in contrast with releases of ISIS prior to version 2.0, in which
entry number $0$ was treated as a wildcard (recall that
entry numbers normally start with 1).
A special version of the address comparison routine, call
\marginpar{{\em The address comparison rule changed
between ISIS V1.3 and V2.0!}}
{\tt paddr\_isequal}, implements a version of this older comparison rule.
This routine returns true if its arguments are equal with the
exception of the {\tt addr\_entry} fields, which it ignores.
This change was made when we found that the wildcard rule created
a great deal of confusion and was rarely valuable outside the ISIS system
itself.
A null address */
\end{verbatim}
The following functions are used to deal with addresses; march.tex 666 1273 212 126774 4714345606 5704 The preceding chapter focused on ISIS at the ``microscopic'' level of a
single process group. This is
the interface through which most ISIS programming is done.
Yet, you may be left wondering about the big picture:
\begin{itemize}
\item
What are the best ways to structure an ISIS application?
\item
How can one anticipate issues of performance and scale?
\item
Under what conditions might ISIS {\em not} be suitable?
How can one mix ISIS and non-ISIS mechanisms?
\item
How should issues like security and overall robustness be addressed?
\item
What high level utilities and applications are available to users of the
system, and how can these be employed most effectively?
\end{itemize}
Our goal in this chapter is to address the first of these questions, namely
the architecture of typical ISIS-based applications.
The next chapter will provide a tour through the major
layers of the toolkit and the existing application facilities.
Jointly, these should help the reader understand
how to approach a large application in which ISIS is to be used \-- and
how to approach the remainder of this manual.
Although architecture has important performance implications,
the aspects addressed in this chapter and the one that follow are
primarily large-scale issues.
We will defer our discussion of performance issues as they arise
in individual process groups to Chapter \ref{Ch:perf}, after
the ISIS interface has been covered in greater detail.
\section{ISIS in the large}
Suppose that you are looking at a truly large distributed system,
and have no idea at all where ISIS can be used most effectively.
From our papers and the first chapter of this manual, you have
concluded that ISIS might be useful in solving some aspects of
your application. Yet, it clearly isn't appropriate in others, and
performance and scale questions loom as major issues for just
down the road.
To decide if ISIS will be useful in your system, it is important to
start with a general idea of what kinds of things ISIS is good
at, and what constraints need to be kept in mind as you apply ISIS
in a larger system.
\section{Use ISIS for things it is good at}
We can come at this material from several angles.
One perspective relates to the basic functionality of the system.
ISIS is essentially a technology for {\em distributed control}.
That is, the facilities ISIS provides are largely oriented towards
controlling some underlying activity in a networked setting.
One tends to look at ISIS as a tool for monitoring a
system or subsystem, reacting to events that affect it, reconfiguring it,
and in general sensing conditions and coordinating
a response in a fault-tolerant manner.
Of course, control isn't all that ISIS is useful for.
In some situations the basic ISIS facilities are, by themselves, an
attractive programming model. For example, a system for trading stocks and
bonds has a significant need to distribute quotes as they are read off the
wire (ticker), and ISIS multicasts seem like a natural way to solve
this problem---if they are fast enough. (As we will see later in this
manual, they definitely {\em can be} fast enough,
but you may need some degree of sophistication to be sure that they {\em will be} fast
enough!)
Likewise, if your goal is to distribute some simple computation over a set
of servers, ISIS might represent a nice set of facilities for doing this,
e.g.~to exploit the parallelism of a multicomputer environment.
In other settings, though, much simpler technologies like UNIX pipes or streams
or remote procedure calls might make sense. In fact, to the extent that
there are standards in distributed computing, these tend to mandate
RPC interfaces (remote procedure calls) and even certain styles of RPC
data encoding.
For applications like these, one is forced
to divide the problem into two parts: those aspects that ISIS can
help with, and those for which the simpler or more standard facility suffices.
{\bf Example 1.} {\em The ISIS group has built two file services that support
file replication and dynamic reconfiguration. These use the standard NFS
protocol to talk to clients and to the file storage units---through a
conventional SUN-RPC interface. ISIS is used to mediate interactions
within the set of file servers, and to maintain replicated state information
concerning the overall status and configuration of the system.
This approach pays off because NFS is quite standard and many OS kernels
are optimized to issue NFS RPC calls very efficiently. In those parts
of the problem that would be hard using conventional programming technologies,
though, ISIS proves to be a big win.
In effect, ISIS is used to control a distributed service that users
talk to through a conventional interface. }
We consider this approach quite reasonable. It is important to
{\em avoid} trying to apply a system like ISIS in every aspect of
an application, unless the match is uniformly good.
ISIS makes some aspects of a problem simpler, and should be used in those
situations, but it might not be necessary---or convenient---to use ISIS all
through your application.
In our own work, for example, we have built many services that use ISIS to
maintain their internal state, but offer a conventional RPC interface to the
world. On the other hand, we don't always do this, and it has problems if
one wants the RPC to look transparently fault-tolerant. The lesson, then, is
that sometimes ISIS offers a natural solution to a problem, but sometimes
ISIS is awkward, fails to conform to standards, or whatever, and must be
made to co-exist with other mechanisms. Our system is designed to
co-exist and if this is needed in your application, it need not be much of
an obstacle.
\section{Hierarchical design}
A second angle on the architecture problem relates to the overall structuring
of your application.
As we saw in Chapter 1, ISIS introduces a new construct called the process group.
We also saw that process groups support fairly closely coupled interaction
between processes. In
many cases, all the processes involved receive and reach to every event of interest
to the group. This may make sense for groups containing two or even three processes,
but makes no sense at all in very large scale settings, where a group might
contain hundreds of processes or span an entire enterprise.
Such a group will clearly need a hierarchical structure.
Moreover, if you are faced with a truly large environment,
a hierarchical structure is unavoidable and ISIS itself imposes such a
solution.
One can easily imagine a local area network
composed of hundreds of machines. One wouldn't want ISIS to span this
whole collection of machines, although it might make sense to run lots of
smaller ISIS installations on subsets of machines and to interconnect these.
Even if one did want to use ISIS that way, the system wouldn't work well.
The following example illustrates this.
{\bf Example 2.} {\em A typical stock brokerage
system consists of a small number of servers
that provide a computational core of the system, and a much larger
number of clients strung around the servers.
Running ISIS directly on the client workstations has many
drawbacks: it would consume resources, the workstations may be
much slower than the servers, and the annoyance of maintaining an
ISIS configuration on them would be substantial. Moreover, the
use a client makes of ISIS is minimal, mostly receiving quotes.
When we first encountered this problem, we thought about trying to
get ISIS running on hundreds of machines simultaneously, but this
is actually not what was needed to solve it. }
In this situation,
an interface that lets large numbers of clients
connect to ISIS on a small number of host machines may be
more appropriate.
The necessary mechanisms are described in Chapter 4
of this manual under a discussion of the {\tt isis\_remote} system call and
the {\tt isis\_probe()} system call.
The result is a relatively
transparent way for client software to use ISIS without paying the
overhead of actually running ISIS on machines where it would perform
poorly. Even with this approach, one may need to broadcast data (quotes)
to hundreds of workstations using protocols based on point-to-point
messages. (Actually, ISIS can take advantage of hardware multicast if the
hardware supports this, but many networks don't.) This observation
implies that quote dissemination may need to run in two modes: a direct
mode if the number of workstations receiving the quote is fairly small,
and a hierarchical one for large groups, in which each member of a
first tier of senders forwards the message to a small subgroup of
destinations. ISIS doesn't yet make this sort of structure transparent, but it
will in future releases.
Such an architecture has another big advantage: the remote workstations
don't need to be declared to ISIS ahead of time, and can be much slower then the
central ones on which the overall ISIS system is actually running.
Yet, the application software gets to use the full range of ISIS facilities, and sees
essentially the same performance as if it ran directly on a machine where ISIS is
available. In the limit, one could imagine a central pool of 20 or 30 ISIS servers
supporting hundreds of remote clients, and providing a whole range
of highly reliable services. Since many remote clients will be diskless, if client X
connects to ISIS on the machine where its disk resides, such an approach will always be
more fault-tolerant than if ISIS were not around, and one need not worry about
the client being up and ISIS being down: if the disk is up, ISIS will be running too.
Whereas a system with ISIS itself running on hundreds of heterogeneous processors
might worry the stock brokerage, the idea of running ISIS on a small subset of
fast machines and hooking lots of clients onto the fringes seems much less
daunting. Indeed, many current ISIS users are adopting architectures of this sort.
The point that this example illustrates is that
when you approach the design of a system it will be important
to start thinking about scale early. The functional decomposition
adopted within your application may be strongly influenced by this long-term
issue, even if the strategy you adopt for actually implementing the
system defers scaling until the relatively remote future.
As we will see, ISIS is moving towards
providing relatively transparent solutions
to a whole range of problems in this area:
subdividing a network into small subnetworks,
subdividing a big process group into little subgroups that each deal with just
some of the operations on the big group. However,
ISIS V2.0 doesn't incorporate all of
these mechanisms yet.
We do know how they will look and in what time frame
they should be available, and this should help you plan ahead.
In the meantime, you can easily implement these sorts of structures
yourself even without tools to completely hide their presence.
One aspect that ISIS is only of limited help with is heterogeneity.
Basically, ISIS is a big-machine technology. It runs well on UNIX
workstations and mainframes, whereas PC's are essentially ignored in
V2.0 (our current plan is to support access to PC's through the remote
client interface mentioned above, but even this is tricky).
This may mediate {\em against} using ISIS in some applications. Certainly,
it is an issue to consider early in the design process.
We have been discussing scale in a way that implicitly assumes a single
local area network, although perhaps spread over a large area and
containing many machines.
The same issue can be carried one level further.
Perhaps you will need to interconnect LAN's in a loosely
coupled manner. Such a system has a clear separation
between ``local'' and remote;
remote lines are more costly and hence available only intermittantly, they fail
more often, may have low bandwidth, and almost always have big latencies.
The presence of these communication links cannot (or, at least, should
not) be hidden from the
application. ISIS provides tools oriented towards building software that
spans such ``long-haul'' links, and one of the problems we'll be examining below is
how and when to make use of these facilities. They are not, however,
transparent in the same sense as the hierarchical facilities that will be
available in the near future.
Thus, a large ISIS application needs to be hierarchical, quite possibly at multiple
levels. One needs to design its long-haul structure, its large-scale local structure,
and its small-scale local structure.
Seen in this perspective, the example of Chapter 1 concentrated on the
ways that ISIS can help in solving a small-scale structuring problem.
Our goal in this chapter will be to review the conceptual tools and
software utilities that ISIS provides in support of this sort of layering.
\section{Process groups as a modularity construct}
We have identified a need to separate aspects of an application for which ISIS
can be helpful from those that might be better solved using some other
technology, and identified a need to design hierarchically to obtain software
that scales well. One must also look for clean functional
decompositions corresponding to separable services---ISIS process groups.
There are many reasons for doing this. Let us review some of the major ones
here.
\begin{enumerate}
\item
{\em Divide and conquer.}
A valuable principle in designing any large software system is to divide it
into more manageable pieces.
This forces one to specify the various internal interfaces clearly, which is
an invaluable aid to getting the pieces to actually perform correctly. It also
lets different people program the different parts of an application.
For example, in building a compiler, one tends to separate out the
symbol table manager, implementing this as a module with a simple,
well-defined interface.
Yet, in distributed settings, there has been a tendency to view systems
as monolithic entities.
In ISIS, process groups are cheap, hence the principle of modularity
extends to distributed systems composed of multiple subprograms.
There are several ways that this can be exploited.
\begin{enumerate}
\item
One might build services
out of identical processes; the service as a whole is then a module
in the overall system.
\item
One might build processes that make use of library routines
structured as modules, so that a process
needing services A, B and C actually includes code on behalf of each
of these modules. If A, B and C are distributed, the resulting process
may belong to three process groups. Unlike the first example, though,
the members of these process groups {\em might not be identical}.
Although they share a mutual interest in whatever it is that A, B and
C do, perhaps one process uses A but not B and C, while another
uses all three. Perhaps one is coded in LISP, another in C, and a
third in FORTRAN. The processes are all using the same abstract
module, but their purposes differ.
ISIS supports this type of structuring as well.
\item
While our comments may seem to suggest that the members of
a given group are all identical, ISIS doesn't require this either.
Thus, perhaps group A consists of a set of expert systems and a set
of blackboard managers, with the experts reading and writing the
blackboard and the blackboards managing persistent data.
\end{enumerate}
To summarize,
ISIS really doesn't care why a set of processes chose to
enter into association or what they wish to do.
It merely offers process groups as a general structuring facility,
which can be used in a variety of different ways.
ISIS is especially powerful at helping you specify and implement
distributed correctness properties that span all the
members of a group---properties that can then be assumed
by callers of the modules or subsystems in question.
\item
{\em Divide for robustness.}
Subdivision of a large distributed system into modules can be a valuable way
to isolate critical components and to make them tolerant of failures.
We won't dwell on this issue here, but it should be clear that
a system in which functionality is isolated in a specific component, which is
carefully programmed to be as robust as necessary, is likely to
behave in a satisfactory manner.
This may not be true in a system where critical functions
are spread throughout large numbers of modules.
For example, the mapping of
logical names to physical values is common in almost any distributed system, and
may arise in many ways. If your application has such a requirement, there
are good reasons to implement a general directory service
providing this functionality, instantiated
once for each type of directory entry and value needed.
Once this service is implemented, it shouldn't be necessary to revisit it
again and again as new needs for directory objects are identified.
Concentrating functionality in a single service also has disadvantages.
It forces one
to anticipate the sorts of hierarchy issues cited earlier. For example,
say that a
large system spanning 30 machines
uses a single directory facility. It is certainly
attractive to factor this mechanism out into a service. But, the resulting
service may now be needed by nearly every process in
the system, and perhaps for different purposes (i.e. different
directories). To give adequate performance, such a service would
certainly need to be designed hierarchically: if all entries are replicated
to all 30 machines, the amount of data on each machine might get very
large and performance would be poor. Thus, while one wants a
generic directory service that is accessible everywhere, it should
internally be
composed of multiple subdirectories, each managed by a small number of
processes located near where accesses to the service originate.
This yields fault-tolerance and has the advantage of letting you
concentrate more energy on building a good directory manager. But, it
also means that right from the outset, a fairly sophisticated
directory manager design is needed.
\item
{\em Process groups as a coordination technique.}
One of the more novel aspects of ISIS is that the members of`
a process group can cooperate {\em indirectly}.
Say that a system is being implemented to run a factory, and
some critical piece of hardware goes down.
Several different programs need to respond promptly, switching to
backup units, rerouting work, rescheduling tasks, and so forth.
In ISIS, sets of processes can often cooperate by joining a group
just to watch for classes of events that all members are interested in.
The processes involved all see the same events in the same order. Often,
this enables you to design algorithms in which processes react immediately
when they see some event and yet know that they are doing ``part'' of
a global task in which other group members are doing complementary tasks.
This is a unique and perhaps unprecedented style of computing, and
it takes some getting used to.
A simple example may help illustrate this idea.
Say that a big database is to be searched for entries matching some
criteria. In ISIS, you can design a database search group
in which, if there are {\em n} processes in the group, each one
searches $1/n$'th of the database.
The processes don't need to coordinate their actions explicitly:
by watching the group membership change, they can each
respond to an incoming request {\em in parallel}, and yet be sure
that each is behaving in a mutually consistent manner.
ISIS makes it possible to use innovative techniques like this in
solving your application.
You may wonder how this kind of server can be made fault-tolerant in ISIS.
Dynamic events like failures and recoveries, or updates
that change the size of the database while a query is being
executed raise questions about the whole idea of subdividing the work in the
manner proposed above.
In fact, such issues can get quite complex, although there are simple
solutions in many settings.
An example of a simple solution would arise in a system where each
server processes one request at a time, to completion. In this case,
a failure might result in a partial answer (say, 3 out of 4 answers),
and we need to encode the answers in a way that lets the process
that issued the query detect such a condition (``answer 2 of 4 is....'').
If searches take a long time and the processes permit some degree of
concurrency, accepting new queries and updates while executing old ones,
such a simple solution won't work, but other schemes can be often
devised to solve the problem. You'll find examples of code to do this
sort of thing in the rest of the manual.
The important question you face as the
architect of a big system is to decide in advance what behavior it
needs to support, and to subdivide the problems until they are specific
enough to be solvable with tools like the ones ISIS offers.
\section{Building robust software}
Fault-tolerant hardware systems often come with a guarantee that
any program you run will execute without crashing due to hardware failures.
Many people turn to ISIS to build fault-tolerant software, and it is
natural to expect that the system will transparently render their
programs fault-tolerant, with little need for analysis or special
programming techniques. ISIS offers a way to build services as fault-tolerant as you
might wish, but the approach is not transparent. In most situations,
you will need to anticipate the nature of the failures your program
actually needs to tolerate and program a solution using the mechanisms
that ISIS provides.
Undertake a risk analysis.
To develop a
long-running, self-controlled distributed service, start by
identifying the {\em threats} to robustness in the intended environment.
Make a picture of your network, at a software and hardware level.
Assume that various single components fail.
How do you wish your application to respond?
How can the components of the application itself fail, and how can the
input to the application be faulty?
ISIS often provides the means to overcome these sorts of problems, but not
unless they are identified at the outset.
Moreover, ISIS can't do very much to help if all the processes that
provide some critical service are dependent on the same UNIX file server or
YP server. ISIS can only help if failures are as independent as possible and
if failure modes are as predictable as possible before you even start to code.
\end{enumerate}
\section{An extended example}
This section presents a
more elaborate example that illustrates the interplay between
these factors in a typical distributed application.
One early ISIS user was a group involved in
building distributed software to verify compliance with nuclear test
ban agreements and to support research into large-scale seismology.
The system in question,
called NMRD, was structured as a set of programs written in
LISP, FORTRAN, C, and other languages.
NMRD actually extends an earlier
seismic monitoring system called IAS; both systems were developed by Science
Applications International Corp. under contract to DARPA Seismology.
IAS was designed as a distributed system, running over Berkeley UNIX
ports and sockets.
The various components are spread over a local area network
and communicate with one another using pipes, asynchronous messaging,
and remote procedure call.
SAIC contacted us when they determined that NMRD would need substantially more
sophisticated distributed control algorithms than did IAS.
Building such a control layer under a conventional UNIX system is
difficult, and SAIC found that it was
investing much too much effort in this aspect of the system.
By using ISIS as a framework for building the desired monitor and control
mechanisms, this task would be simplified.
\begin{figure}
\makebox[511pt][l]{
\hspace{-95pt}
\vbox to 307pt{
\vfill
\special{psfile=kb90-005.ps}
}
}
\begin{center}
{\em NMRD system architecture}
\end{center}
\vspace{1in}
\caption{\bf NMRD system architecture.}
\end{figure}
Although the original IAS system ran on a LAN, there were no mechanisms
for interconnecting LAN's. NMRD was to be used at several
locations worldwide, which needed to be
interconnected to exchange seismic data and to support inter-LAN queries.
The architecture is illustrated in Figure 2.1, where we see the
Norway instance of the system. Among the major components shown are
SIGPRO programs, which acquires seismic data from sensors;
it runs a signal detector to determine
the interesting segments of data, thus effecting a reduction in the
amount of data which must be further analyzed by ASSESS and other
``downstream'' processes.
In addition, the ASSESS module is informed by SIGPRO of detections and
looks for correlations, a DB module maintains databases of
seismic and geologic data, and a DISPLAY subsystem displays the
data and interacts with an operator. A large number of other programs
come into play only as needed, i.e., on operator command.
The basic control problem that NMRD addresses is easily understood as
a state machine. A given data stream (a set of detections from one
or more input channels) is successively transformed by application of
a series of filters, yielding higher and higher level interpretations of
the seismic events that generated the signal. NMRD is thus concerned with
a distributed control problem: overseeing the ``routing'' of data
through the network of processes that make up the underlying analysis
system. As we will see, this control aspect of the system is
one in which ISIS can be extremely helpful.
It is not hard to make a first cut at distributing this application
over a network. Fast numerical workstations form the backbone of the system; these
are used to run the major components. Database machines provide the DB functionality,
and high performance display stations are used for input and output.
Most programs were originally designed to interact through files; it is
easy to extend them to interact by message passing.
One sees that it would be a simple matter to arrive at a working
distributed NMRD system; in fact, SAIC had completed such a system
well before they became interested in ISIS.
Problems creep in when one starts to think about automating the management of
the system, especially its response to failures.
As outlined above, NMRD would be an extremely fragile piece of software.
The challenging problem is to automate the control of the overall
system---and
to interconnect the various local versions of the system into a single
global LAN. This is where ISIS comes into play.
Following our own advice, we'll begin by focusing on the NMRD system in a single
LAN, using a tightly coupled design that assumes high speed communication, then
extend it to deal with long-haul links.
Within each LAN, our goal will be to support a highly fault-tolerant
{\em distributed application management} program which monitors activity
within the LAN, displays current system status in an intelligible
(graphical) manner,
supports some degree of scheduling (perhaps, controlling the order in which
component programs perform requests), and restarting failed components in
as transparent a manner as possible.
Clearly, we need to make some changes to the existing software to get it to
cooperate with this controlling facility, but on the other hand,
those old
FORTRAN and LISP programs can get pretty ``dusty''.
We'll need to minimize the changes made to them---ideally, we want to leave
all the old code pretty much unchanged, introducing a small ISIS-based
interface that knows about the networking environment but mimics
the old batch environment as faithfully as possible.
With this idea in mind, the next step is a risk analysis of the overall
environment. This being a UNIX system, the problems are well known ones
and can be circumvented by taking appropriate care.
The major types of failure anticipated are crashes of individual machines and,
especially, of individual file servers.
We'll want to minimize dependencies on any single machine or any single file
to achieve robustness to these sorts of events.
Less frequent events like Ethernet crashes are rare enough to
be ignored in this case. (Were it
critical to keep the application running even if these sorts of events
occurred, it would be necessary to install redundant communication hardware.)
Critical software
resources like the ``Yellow Pages'' service can should be
replicated, and bindings of programs to the yellow pages controlled to
ensure that if any one YP server dies, the impact on the network will be
localized.
When crashes do occur, we'll want to automate the restart of affected components
of the system.
Finally, one of the features we will want to incorporate into the
NMRD distributed management scheme is the ability to run the system in
test/debug mode easily, since it will always be incorporating new
processing methods. With such software, it is inevitable that there
will be program bugs. Since it is impossible to prevent program
crashes, our scheme includes a programming paradigm to allow for easy
damage control---more on this later.
The next step in designing our system is to deal with its basic
communication structure.
That is, how does a particular SIGPRO instance (notice that there are
several of these; in fact, the number might vary dynamically) know which
ASSESS program to talk too? How does ASSESS know how to reply?
And, can we potentially support multiple executions of the {\em entire NMRD system}
in a single LAN, e.g. for development purposes?
How can ASSESS be made fault-tolerant (notice that this requires that
SIGPRO programs be reconnected to a new ASSESS after failure, and that ASSESS
be brought back into a reasonable state after being restarted).
These questions can be seen to revolve around two basic problems: what
exactly is the ASSESS module in our design, and how does one map from its
logical name (ASSESS) to some sort of physical address we can use to communicate
with it.
In the discussion above, we have used the term ASSESS as if it denoted a single
program, augmented with some sort of fancy IO module that reads messages (perhaps
from ISIS) and turning them into what look like records from files.
To solve this long list of problems, however, we'll need a somewhat fancier
representation of ASSESS.
Say that we pair every set of related programs in the real NMRD system
---e.g., the programs making up the ASSESS process, or the SIGPRO
process---with an agent process implemented using ISIS. The agent
process can monitor the health of the programs within its set,
restarting them as needed. Moreover, it can communicate with the IO
module within an active ASSESS component, for example to tell it what
communication route to use to communicate with a particular ASSESS.
This localizes the problem, since we get to design the IO module and can
send it messages, covertly, updating this routing data.
The ASSESS agent can control the order in which ASSESS takes
actions, and spool copies of messages sent to it that affect its ``state'' and
periodic state checkpoints. Thus, to restart ASSESS after a crash, the agent
could pick a suitable machine on which to run the new instance, reload
the state from the checkpoint, replay messages needed to get it back in sync
with the remainder of the system (telling the IO module to suppress any output
generated in the process), and finally reconnecting it to other system components
by informing the appropriate IO modules of the new address.
Damage control due to program bugs is made easier by imposing two
requirements upon NMRD application programs: first, that they be
idempotent, and second that they synchronize database commits with
state checkpoints. As it turns out, neither of these requirements are
difficult to achieve. The original application codes were in most
cases wholly idempotent from the start, and so needed little
modification. Synchronizing the commits with the checkpoints is also
easy simply by sending a checkpoint message to the agent process
immediately after any database commits. The two operations
(committing to the database and checkpointing) cannot be made
completely atomic, but the window of vulnerability is minimized.
Damage control after a program crash amounts to fixing the fractious
program, restarting it, and then feeding in the data upon which it
first crashed. It is not necessary to rerun the data through all the
preceding processing steps, nor is there much danger of the database
or the agent state being inconsistent.
Our design has a few disadvantages, and these need careful study.
One problem is the potential overhead of the architecture; fortunately, in the
actual NMRD application, this proved to be quite low.
A second problem is that the agent processes will be quite pervasive: for every
component of the system, one needs two or three agents (to obtain fault-tolerance).
This raises problems of hierarchy in the overall agent subsystem,
and also forces us to design agent processes in a way that will be tolerant
of failures, but at the same time efficient of communication and computation.
For example, having all agents do each operation in parallel looks like a loser; the
computational cost of running the application management code might get to be
a real problem. Fortunately, as we will see throughout the remainder of
this manual, ISIS has powerful tools oriented towards precisely these sorts of
problems.
With regard to the hierarchy problem, several insights suggest a way to
structure agents so as to limit costs. First, observe that what an agent
actually does is largely the same regardless of the type of component it is
monitoring. This suggests that one might factor out
all component-specific code either into a separate agent process group, or
at least into an isolated module of per-component mechanisms.
The remaining ``generic'' agent code would be gathered into a single
heavyweight process, with one instance of this process run for
each segment of the signal processing pipeline. There would be at most one
instance of this process on any physical machine.
The {\em responsibilities} of each such process will vary depending on the
local need. Thus, if a given machine is actually running a SIGPRO component,
the local generic agent process should presumably include a specific agent
instance on behalf of that SIGPRO. This might be a lightweight task, an entry
in a data structure, or whatever else is needed.
Moreover, the same heavyweight process might function as a {\em backup}
agent for some set of other NMRD system components, perhaps two SIGPRO components
running on other machines in the network.
Ideally, we will want our local agent to handle most ASSESS requests completely
locally---without multicasts to its backup agents.
The backup agents should be idle as much as possible, taking actions only when
informed of some critical event by a local agent process or when a local
agent fails.
This is essentially the structure ISIS calls a coordinator-cohort scheme, and
the toolkit makes it
easy to implement. Notice, however, that our design requires that the programmer
be comfortable with several ideas that will seem hard to understand on first
encounter: the idea that a heavyweight process might contain many
lightweight tasks, the idea of a big process group with little subgroups within it,
each doing a coordinator-cohort computation, the idea of replicated system
state (for example, the information managed by IO modules that tells them who
to communicate with), and so forth.
This is not the sort of application that a new ISIS user could build without
practice. It is, however, a reasonable solution to the problem and one that
makes sense within ISIS. A sophisticated ISIS user will come to see this
sort of design as ``natural''. Such a user will find the system tremendously
valuable.
We won't pursue this problem much further. However, several of the issues raised
in the above discussion are worth considering briefly.
the first relates to naming.
Potentially, one could imagine having several NMRD applications running at once,
each with its own copy of ASSESS, several copies of SIGPRO, and so forth.
We need a way to give these sensible, unique names.
ISIS applications solve this sort of problem using a hierarchical naming
convention (``/nmrd1/sigpro2''), which looks natural enough, but does raise
the question of how a given SIGPRO figures out its name and how the rest of the
system gets told what this name is. We'll tackle this below---the
mechanism is pretty straightforward. You can think of it as
a conventional UNIX file system scheme---in fact, names like ``../alarm''
even work as expected. While the ISIS naming facility won't actually
assign unique names, by making use of a few extra process
groups (and why not; they are very cheap), this problem is easily solved.
For example, SIGPRO processes in ``nmrd1`` could start by joining a
group called ``/nmrd1/sigpros''; within this group it is trivial to
count members (this is a benefit of virtual synchrony), hence SIGPRO {\em 3}
can figure out that it should name itself ``/nmrd1/sigpro3''.
Another issue relates to the fact that the old code in NMRD probably doesn't
respect stack usage limits and hence can't be run inside ISIS lightweight tasks.
This problem is solved by running the old code
on the original system stack (the one UNIX provides and grows automatically),
with only the IO module running on dynamically allocated stacks.
Since this module consists of new code, it should be easy to keep its use
of the stack under a reasonable degree control.
The approach isn't hard to implement, provided that one realizes that it is needed.
Still another issue involves monitoring.
Our agent processes will need ways to track the load on individual
system components, to pick a good machine to restart ASSESS on if it crashes,
and in general to monitor sensors so as to detect when certain conditions
arise and trigger appropriate reactions.
ISIS solves this through a high level facility called META.
META basically manages a database of sensor values and
interprets high-level queries against the database as it changes in realtime.
META is built on ISIS and co-exists with it nicely, but is quite a complex
subsystem. We describe it later in this manual.
Fortunately, security is {\em not} an issue here, which is good because it wouldn't be
an easy one to address. ISIS isn't very good at security yet,
we have plans to extend the system in this area in the future.
A final issue relates to the interconnection of NMRD on one LAN with
NMRD on another one.
ISIS provides fairly simple but powerful tools for this purpose.
They let you send messages from a group on one LAN to a group on another,
or entire files (even large ones), with built-in spooling and reliability
mechanisms so that communication can be batched up and performed when the
link between two LAN's is available.
Typical of the facilities that NMRD might want in such a setting is a
scheduling mechanism (one wouldn't want to let Washington overload Norway with
remote signal processing requests, but on the other hand we certainly
want to keep the system busy), distributed data management (e.g. lists of
pending analysis requests and file names where the results should be stored), etc.
We have implemented demonstrations of this sort of facility in ISIS,
and the discussion appears elsewhere (in a paper available from
Cornell). One might thus adapt these prototype solutions to the problems that
arise in a system like NMRD.
Notice how we approached the NMRD problem. First, a general
system structure was proposed. We studied it to understand how we
wanted things to work when the system runs automatically, and
gradually changed these goals into concrete requirements on our
broad system modules. We factored functionality out using process groups,
but postulated a number of groups and used them in varying ways.
Finally, within specific groups, we looked for solutions to specific
subproblems, and at this level found solutions within the ISIS Toolkit.
ISIS provides a powerful
environment for solving the problems that a system like NMRD confronts.
At its core, NMRD
is a control problem, and ISIS offers both an architecture within which the
problem can be tackled and a set of facilities for actually implementing the
desired solution. On the other hand, the solution wasn't trivial.
As it happens, the ISIS Project at Cornell is now
building a general purpose application management tool and
hope that much of what a system like
NMRD needs can be addressed using this utility.
However, such a utility
will not eliminate the obligation of the systems architect
to think in terms of distributed application structures, to
anticipate aspects of the system that need to be robust, to find ways to
achieve that robustness efficiently, to worry about modes
of failure and performance issues, and to design for scale.
While ISIS makes some very hard problems tractable,
it certainly won't put anyone out of work. Instead, it encourages
you to take on more ambitious problems and devise more powerful solutions.
We like to think of ISIS as opening the door to a completely new style
of distributed programming.
It would certainly be pleasant if a system like ISIS could make all this
transparent---perhaps, for certain classes of applications, we will someday
be able to do so. In the meanwhile, though, it is our assertion that
ISIS provides an unusually effective technology for solving this class of
problems, and that it encourages a mindset within which one is lead to
solutions that work unusually well.
Ad-hoc solutions to difficult problems are rarely
effective and almost never robust.
ISIS and META, for many reasons that enter at many levels of the problem,
offer a far more preferable alternative.
Our system has its limitations, and it will be some time before these are
overcome. NMRD, as described here, is clearly not a system composed of
thousands or even hundreds of nodes. If it were, a whole new class of issues
would arise that the present version of ISIS is inadequate to solve.
However, performance and scale are two major foci of our research effort at
present, and we believe that both obstacles can be overcome well before such
large networks become common in the market; certainly, few existing LAN's
involve close couplings between such large numbers of machines.
Not long ago, performance of ISIS was a major issue even
in small settings, and we have largely eliminated this as a problem with
ISIS V2.0. Our progress in solving one problem after another in a challenging
system encourages us to believe that these future challenges, too, can be
confronted and resolved.
\section{A layered technology}
In light of the above discussion, it should not be surprising that we think of
ISIS itself in terms of layers. The next chapter is concerned with
layers: how ISIS is structured, what pieces need to be around when
an ISIS application is running, where the functions of a given
layer are actually implemented, and how
the ISIS layering relates to the way your application should be
structured. This will prove to be a valuable introduction to the
remainder of the manual, which looks at the ISIS toolkit from the
lowest levels up.
Throughout, we will work with small examples, illustrating how one tool
or another should be used.
To make the most effective use of the ISIS system,
you will need to acquire an ISIS mindset.
Reading this chapter, you should have gained some sense of what
that mindset is concerned with.
We believe that real mastery of ISIS
comes from actual use of the system.
Our advice is that the reader experiment, build small chunks of code,
design and debug modules in isolation, and gradually work his or her
way into genuinely large-scale design efforts.
In such an approach, each problem addressed
is understandable in terms of
a technology that the designer has already mastered.
One of the pleasant surprises about virtual synchrony is that the
technical issues that
arise at each level prove to be managable ones.
Although we haven't used this term anywhere in this chapter,
this unifying, deeper concept pervades the whole of ISIS and
makes it possible to launch into many designs secure in the
belief that you really will be able to solve the problems that
arise once you define them carefully enough. In ISIS, most
distributed systems problems admit a clear definition and a simple
solution.
lutibase.tex 666 420 420 2637 4255306517 5600
This chapter covers most of the basic facilities of ISIS.
Most of the applications you might want to build can be constructed from
the set of functions described here.
As you gain experience with ISIS, you may wish to read further to learn
about the more sophisticated tools, or to find out ways of improving the
performance of your application.
For each function below, we give its declaration, a description of
its parameters, and a short note on what the function does.
\section*{Error numbers}
Functions that return integer values return -1 if an error\index{error numbers}
occurs (unless a negative number is a reasonable value for the function to
return).
Functions that return pointers return $0$ on error.
Functions that return structures return the appropriate null structure
(e.g. a function that returns an address returns a null address).
When an error occurs, the global variable {\tt isis\_errno}\index{\tt isis\_errno} usually contains
an error number indicating its type.
This is often also the return value of functions that return
integers.
Constants of the form {\tt IE\_XXXX} are defined in the file {\tt
isis\_errno.h} \index{\tt isis\_errno.h} to refer to error numbers.
The function {\tt isis\_perror(string)}\index{\tt isis\_perror} prints out {\tt
string:} followed by a textual description of the error indicated by {\tt
isis\_errno}.
Any exceptions to these rules will be indicated in the description below.
.
Our system has its limitations, and it will be some time before these are
overcome. NMRD, as basic.tex 666 420 420 2310 4254042740 5725 \newenvironment{func}[1]{
\vspace{2\baselineskip}\begin{tt}\begin{samepage}
\par\noindent\Marginpar{\em #1}\begin{minipage}{\linewidth}}{
\end{minipage}\end{samepage}\end{tt}\vspace{1ex}}
\newenvironment{tfunc}[1]{
\vspace{2\baselineskip}\begin{tt}\begin{samepage}
\par\noindent\Marginpar{\em #1}\begin{minipage}{\linewidth}
\begin{tabbing}12\=3456789012\=3456\=\kill}{
\end{tabbing}\end{minipage}\end{samepage}\end{tt}\vspace{1ex}}
\newenvironment{code}{
\vspace{1ex}\noindent\begin{tt}
\begin{tabbing}1234\=5678\=9012\=3456\=7890\=1234\=5678\=9012\=\kill}{
\end{tabbing}\end{tt}\vspace{1ex}}
\newcommand{\defn}[1]{\item[{\tt #1 \hfill}]}
\newenvironment{defs}{
\begin{list}{}{\setlength{\leftmargin}{0.75in}
\setlength{\labelwidth}{0.75in}
\setlength{\topsep}{1ex}
\setlength{\itemsep}{0in}
\setlength{\labelsep}{0in}}}{
\end{list}}
\newcounter{example}[section]
\newcommand{\exam}{\vspace{1ex}\refstepcounter{example}
\paragraph*{Example \theexample}}
\input{base}
\input{address}
\input{message}
\input{bcast}
\input{pgroup}
\input{monitor}
\input{xfer}
\input{startseq}
or(string)}\index{\tt isis\_perror} prints out {\tt
string:} followed by a textual description of the error indicated by {\tt
isis\_errno}.
Any exceptions to these rules will be indicated in the description below.
.
Our system has its limitations, and it will be some time before these are
overcome. NMRD, as bcast.tex 666 420 212 30714 4662552163 6000 \section{Broadcasts and replies}
A broadcast \index{broadcasts, introduction} involves sending a message to a process or process group and
\index{messages, broadcast to a process group}
optionally collecting a specified number of reply messages from the
recipients.
If a broadcast is made to a process group, a copy of the message is sent to
each of its members.
The delivery of broadcast messages are ordered relative to one another.
By this we mean that if any pair of broadcast messages go to overlapping
destinations, they will be delivered at all such destinations in the same
order.
Broadcast message delivery is also ordered relative to other distributed
events like the notification of group membership changes (see below).
As a consequence of this,
if a member of a process group receives a broadcast message addressed to
the group, then a copy of the message will also be sent to all other
members that it knows to be in the group at the time the message is
received.
(It follows from this that all the members will have exactly the same view
of the group membership when any particular broadcast message is received.)
In terms of our virtual synchrony model, this means that the delivery of
any given broadcast message
can be viewed as a single virtually synchronous distributed event.
A process performing a broadcast specifies an entry number in addition to
the address.
This entry number determines which task will be invoked in a recipient
to handle this message.
Each of the recipients must have associated the given entry number with a task
using the {\tt isis\_entry} declaration.\index{\tt isis\_entry}\index{{\em entry} number, declaring}
\index{\tt msg\_rcv}\index{\tt MSG\_ENQUEUE}\index{\tt msg\_ready}
\begin{func}{Declare an entry}
\begin{verbatim}
isis_entry (entry_no, routine, routine_name)
int entry_no;
int (*routine)();
char *routine_name;
\end{verbatim}
\end{func}
\begin{defs}
\defn{entry\_no} Entry number.
\defn{routine} Task to be invoked or the macro {\tt MSG\_ENQUEUE}.
\defn{routine\_name} Null-terminated string containing the name of the
routine {\tt routine}.
\end{defs}
This function declares that when a message addressed to entry number {\tt
entry\_no} arrives at this process, a task is to be started up with the
invocation {\tt routine(msg\_p)}, where {\tt msg\_p} is a pointer to the
incoming message.
\index{messages, received by an entry}
\index{{\em entry} number, entry undefined}
If a message addressed to an undeclared entry number is delivered to a
process, it is simply discarded.
\index{messages, discarded silently due to undefined entry number}
Entry numbers should be in the range $1$ to $100$.
If the function is specified as {\tt MSG\_ENQUEUE}, messages to the
corresponding entry point are queued up and must be individually
received using a call {\tt mp = msg\_rcv(entry)}.
The calling task will block until a message is available; a
call to {\tt msg\_ready(entry)} will return the number of messages
available on the entry or 0 if there are none.
We generally recommend that users program using a call-back style
in which task creation occurs on each message. However,
LISP users and users with special performance requirements may need
to consider {\tt msg\_rcv} because of its much lower overhead.
We say more about this interface in the manual section on ``non-standard
tasking requirements''.
Instead of having to call {\tt msg\_newmsg} to create a message, {\tt
msg\_put} to insert data into it, and {\tt msg\_delete} to delete it when you
are done, the broadcast interface allows you to specify the data you wish
to put into a message (as in {\tt msg\_put}).
The system creates a new message for the data and deletes the message
when it is no longer required.
On the receiving side, {\tt msg\_get} must be used to read data out of the
delivered message.
The sending process can specify the number of reply messages it wants to wait
for.
The constants {\tt ALL}\index{\tt ALL} or {\tt MAJORITY}\index{\tt MAJORITY}
can also be specified, denoting
that it wants to wait for replies from all or a majority of the currently operational
processes to which a copy of the message was sent.
The interface allows you to specify where the data in a reply message is to
be stored, so you do not have to be concerned with collecting and deleting
reply messages.
This mechanism is flexible enough to be used as an
asynchronous message
passing discipline or as a generalized form of remote procedure call.
{\em Note: The replies can only come from processes to which a copy of the
broadcast was actually delivered.
Thus, if a process joins a process group, it should not attempt to
reply to a broadcast that had been sent to the group prior to its
joining. This is true even if a copy of the message is passed as part of a
state transfer. Were a reply were sent by a non-destination of a broadcast, it would
be discarded by ISIS silently. }
\index{{\tt bcast}, short form}
\begin{func}{Broadcast a message and collect replies}
\begin{verbatim}
int
bcast (addr_p, entry, fmt1, arg1, arg2, ..., nwanted,
fmt2, rep1, rep2, ...)
address *addr_p;
u_char entry;
char *fmt1, *fmt2;
\end{verbatim}
\end{func}
\begin{defs}
\defn{addr\_p}
Pointer to the address of the group (or process) to which to deliver the message.
\defn{entry}
Entry number in recipient process(es).
\defn{fmt1}
Format string for data to be put into outgoing message (see
Section~\ref{fmt}).
\defn{nwanted}
Number of reply messages wanted, or the constants {\tt ALL} or {\tt MAJORITY}.
\defn{fmt2}
Format string for data to be read out of incoming reply messages (see
Section~\ref{fmt}).
This and the arguments that follow it may be omitted if ${\tt nwanted} = 0$.
\end{defs}
\index{reply to a message}\index{{\tt reply}, short form}
\begin{func}{Send a reply to an incoming message}
\begin{verbatim}
reply (in_msgp, fmt1, arg1, arg2, ...)
message *in_msgp;
fmt1 *char;
\end{verbatim}
\end{func}
\begin{defs}
\defn{in\_msgp}
Pointer to incoming message.
\defn{fmt1}
Format string describing how data is to be put into reply message.
\end{defs}
{\tt fmt1} specifies the type of data to be put into the outgoing message,
and the arguments following {\tt fmt1} are treated exactly as in {\tt
msg\_put}.
{\tt fmt2} specifies how data is to be read out of reply messages
(all replies to the same message are read using the same format string).
The arguments following {\tt fmt2} differ from those in {\tt msg\_get} in
one aspect.
In {\tt msg\_get} the arguments are pointers to variables of the
type specified in the format string.
In {\tt bcast}, the arguments are pointers to {\tt arrays of} such
variables, the size of each array being at least {\tt nwanted}.
The data from each reply message goes into an element of these arrays, the
order in each array being the order in which reply messages are collected.
Normally, a call to {\tt bcast} returns when {\tt nwanted} replies
have been received, with any subsequent replies being ignored.
The return value is the number of replies collected.
This function also sets the value of the global variable {\tt isis\_nsent}
to the number of processes that were sent a copy of the outgoing message,
and the value of the global
variable {\tt isis\_nreplies} to the number of replies collected.
When a specific value is given for {\tt nwanted}, it is often an
error if the number of replies received is smaller than this number.
We had originally planned to have ISIS return an error code in this case.
However, a problem then arises because the caller will still need to deallocate
any memory allocated as a result of the reply format items {\tt \%+x}
or {\tt \%m} having been used.
In the current implementation, ISIS returns the number of replies received, and
the user is expected to decide if a broadcast
has failed because too few replies were received.
\subsection*{Null replies}
A call to {\tt bcast} returns when {\tt nwanted} replies have been
received or when all the processes that were sent a copy of the message
have either replied or terminated.
It is possible to code an application in which it is known that, for
example, only 3 out of 6 members of
a process group will reply to a given message, and so
the sender sets {\tt nwanted} to 3.
If one of the 3 members supposed to reply terminates before replying
(possibly due to a failure), there is no way for the system to know that
the remaining operational members will never reply to this message.
The call to {\tt bcast} will never return unless the other 3 recipients
happen to terminate as well.
If this type of situation could arise in your application, recipients of a
message that do not intend to reply should inform the system of this
by sending a ``null reply'' to the message.
In general, a call to {\tt bcast} will return when {\tt nwanted} replies
have been received or when all the recipients of a message
have replied, terminated or sent null replies.
\begin{func}{Send a null reply to a message}
\begin{verbatim}
nullreply (in_msgp)
message *in_msgp;
\end{verbatim}
\end{func}
\index{\tt nullreply}
\begin{defs}
\defn{in\_msgp}
Pointer to incoming message.
\end{defs}
{\em Note:} A {\tt nullreply(msg\_p)} is not the same as {\tt reply(msg\_p,
"")}.
The former is never visible to the caller, whereas the latter
results in an empty reply message that really will reach the caller.
In particular, a null reply can never trigger the error code {\tt
IE\_MISSMATCH}, whereas the latter could do so (this error code arises when
the contents of a message do not match with the format items specified
in a {\tt msg\_get} format string.
\index{\tt abortreply}
\begin{func}{Abort a broadcast}
\begin{verbatim}
abortreply (in_msgp)
message *in_msgp;
\end{verbatim}
\end{func}
\begin{defs}
\defn{in\_msgp}
Pointer to incoming message.
\end{defs}
{\tt abortreply} causes the caller of the broadcast to stop waiting for replies.
Error code {\tt IE\_ABORT} is returned, and any replies collected
by the system prior to receiving the abort are discarded.
\index{\tt forward}
\begin{func}{Forward a message}
\begin{verbatim}
forward (in_msgp, to, entry, out_msgp)
message *in_msgp, *out_msgp;
address *to;
\end{verbatim}
\end{func}
\begin{defs}
\defn{in\_msgp}
Pointer to incoming message.
\defn{to}
Pointer to a variable giving the address of the destination.
\defn{entry}
Entry number to use in destination.
\defn{out\_msgp}
Pointer to outgoing message.
\end{defs}
{\tt forward} is used by a process that is acting as an intermediary
\index{messages, forward to different destination}
between the caller of a broadcast and the process that will actually reply.
The system will deliver a copy of {\tt out\_msg} to the designated address
arranging that any reply sent to {\tt out\_msg} is delivered to the sender of
{\tt in\_msg}.
Although it is expected that {\tt out\_msg} will generally be the same as
{\tt in\_msg}, this is not required, and in fact the two messages can be
completely different (for example, this can provide a way for a trusted intermediary to
avoid forwarding access credentials).
Routines are available for determining whether or not a message has been
forwarded ({\tt msg\_isforwarded (msgp)}\index{\tt msg\_isforwarded}, for
\index{messages, determine if forwarded}
determining the ``original sender'', to whom replies will be delivered
({\tt msg\_getsender (msgp)}\index{\tt msg\_getsender}\index{messages, determine original sender}, and for
determining the ``most recent'' sender of a message ({\tt msg\_gettruesender (msgp)}\index{\tt msg\_gettruesender}.
\index{messages, determine most recent sender}
\subsection*{Programming considerations}
The {\tt bcast} interface may seem unusable if the number and type of data
items to be put into a message or a reply may vary at runtime, or if
replies from different recipients carry different types of data.
(You may wish to put calls to {\tt msg\_put} in a loop or in an {\tt
if}-statement, for example.)
Another interface that gives you explicit control of the message to be
sent is described in Chapter 5.
It also gives you greater flexibility in specifying destinations for a
broadcast message (you can specify multiple process groups, for example).
Another way around this problem is to
create your own message using {\tt msg\_newmsg}, put in
the data as required, and copy this message as a whole into the broadcast
(or reply) message using the {\tt \%m} format item.
On reception, this message can be extracted using {\tt \%m} and the data
can then be read out of the extracted message using {\tt msg\_get}.
Remember to call {\tt msg\_delete} on any message you
create or read out of another.
f situation could arise in your application, recipiebcinterface.tex 666 372 212 51237 4675160157 7164 \section{The ISIS broadcast and reply primitives}
Previous sections included calls to the ISIS broadcast
and reply facilities.
This section briefly sets out the alternatives provided.
\subsection*{Broadcast, short form}
\index{{\tt bcast}, short form}
The short broadcast invocations look like this:
\begin{verbatim}
nresp = [x]bcast(addr_p, ent, fmt, args..., nwant, fmt, ...);
\end{verbatim}
Here, the optional character {\tt x}, which can be one of {\tt a, c, f}
or {\tt g}, specifies the type of broadcast to do; it will be an
{\tt abcast} if left unspecified.
The choice of broadcast primitives available is discussed in Chapter 15.
The reader may want to familiarize him or herself with these choices, because
{\tt abcast} is a fairly expensive protocol, much more so than {\tt cbcast}.
The address and entry specify the destination of the broadcast.
The format string and arguments are used to construct the message that
will be sent, after which {\tt nwant} answers will be collected.
The value of {\tt nwant} may be any non-negative integer,
or one of the special values {\tt ALL} and {\tt MAJORITY},
designating a reply is desired from all the processes in the
eventual destination list, and from
one more than half that number respectively.
The answers are collected into response vectors according to the
second format string.
After the request completes, {\tt nresp} will either be -1, in which
case {\tt isis\_errno} will indicate the error that occurred
(see {\tt isis\_perror(str))}, or a non-negative number
designating the number of replies actually received, which
will also be stored into the global variable {\tt isis\_nreplies}.
The global variable {\tt isis\_nsent} gives the length of the
destination list that was actually used by the broadcast.
\index{{\tt bcast}, number of messages sent / replies received}
Possible error codes are: {\tt IE\_TOOLONG}, indicating that the
internal limit of {\tt MAX\_PROCS} destinations for a single broadcast
was exceeded, {\tt IE\_BADARG}, indicating that an error was discovered
in one of the format items, or {\tt IE\_UNKNOWN}, indicating that the
group address specified was unknown to the system.
Notice that although ISIS provides a way to ask for replies from just some of the
members of a group,
it does not provide a mechanism for {\em broadcasting} to just some of the members.
However, there is an easy way to get this sort of behavior.
Say that you do not want to broadcast to an entire group at a time,
for example because the group provides some system service and its local
representative should be able to answer on behalf of the entire group.
To send to just some members, it is recommended that you call {\tt pg\_getview}
to determine the current membership of the group.
By scanning the membership list, one can easily identify any member
at the same site of the caller ({\tt addrp->site == my\_site\_no}).
If there is none, a random member can be selected instead.
\subsection*{How multiple replies are stored in memory}
If you ask for multiple replies to a broadcast,
\index{{\tt bcast}, how replies are stored in memory}
expect to get {\em vectors} of answers back.
For example, say that a broadcast returns 2 replies and the
reply format is specified as {\tt "\%d"}.
The corresponding {\tt bcast} format should be {\tt "\%d"}, but the
corresponding argument should be the address of a {\em vector}
of addresses 2 or more elements long.
The replies will be saved into this vector one by one, in whatever order
they arrived.
If your application needs to know who sent the reply, include this as part of the
reply. ISIS will not automatically provide this information.
When the reply itself is a vector, things get fancy.
\index{{\tt bcast}, reply includes a vector}
Say that the reply format is given as {\tt "\%C"} or {\tt "\%A"}.
In this case, you should give the address of a sufficiently large
vector of characters (resp. addresses) to hold the entire set of
replies, plus the address of a vector of integers into which the
length of each reply can be saved.
ISIS will copy the first reply it receives, say {\em n} elements long, into the
first {\em n} elements of the vector you supplied, and set the
first element of the length vector to {\em n} (unless the length vector
is specified as a null-pointer).
The next reply will go into successive elements of the reply vector, and the
length will be copied into the next element of the length vector, etc.
If you expect fixed-length replies, this is all quite natural: dimension the
reply area, e.g., {\tt char answers[MAXANSW][ALEN]}, and if each answer is
indeed ALEN bytes long, the {\em i'th} answer will be {\tt answer[i]}.
The same scheme would not work for variable-length answers with {\em maximum}
length {\tt ALEN}; here, one would have no choice but to declare {\tt char answers[MAXANSW*ALEN]}
and then compute the offset to the {\em i'th} answer by summing the
lengths of answers {\em 0..(i-1)}.
In our experience, variable length answers are easier to work with if
ISIS is asked to do dynamic memory allocation, using a format item like {\tt "\%+C"}.
Here, one just specifies a vector of pointers to the answers, e.g. {\tt char *answers[MAXANSW]},
and ISIS fills in values for each reply received.
\subsection*{Examples of various reply formats}
To illustrate these ideas, consider the following code fragments.
\index{{\tt bcast}, examples of reply formats}
Say that the reply is sent using the statement:
\begin{verbatim}
reply(msg_p, "%A[1] %d", &my_address, my_answer);
\end{verbatim}
A typical broadcast matching this reply might be the following:
\begin{verbatim}
address his_address[2];
int his_answer[2];
bcast(service, LOOKUP, "%s", name, 2,
"%a %d", his_address, his_answer);
\end{verbatim}
Notice that even if only one answer is expected, a vector must be provided
to the {\tt bcast}.
Notice also that the sender of the reply was forced to use a pointer to
the address being transmitted (the sending format {\tt \%a}
is not supported by ISIS because it requires that a structure be
passed by value, which is a non-portable operation in C).
The recipient of the reply is permitted to use the {\tt \%a} format
because in this case the place to store the replies is
supplied as a pointer to an array, and pointers {\em can} be
passed to {\tt msg\_get} in a portable way.
We could also have used the message formats {\tt \%-a}, {\tt \%+a},
or {\tt \%A[1]}, but in this case the variable {\tt his\_address}
would need to be of type { address *his\_address[2]} and the
resulting pointers would remain valid only so long as the message
was not deallocated (except in the case of {\tt \%+a}, where the pointers
would point to a malloc-ed object).
Now assume that the reply includes an array of integers:
\begin{verbatim}
reply(msg_p, "%A[1] %D", &my_address, my_answers, ANSW_LEN);
\end{verbatim}
The caller would be expected to provide a vector of arrays:
\begin{verbatim}
address his_address[2];
int his_answer[2][ANSW_LEN], his_len[2];
bcast(service, LOOKUP, "%s", name, 2,
"%A[1] %D", &his_address, his_answer, his_len);
\end{verbatim}
Here, {\tt his\_len} gets filled with the constant {\tt ANSW\_LEN}.
Finally, for the variable length case, the caller would code:
\begin{verbatim}
address his_address[2];
int *his_answer[2], his_len[2];
bcast(service, LOOKUP, "%s", name, 2,
"%a %+D", his_address, his_answer, his_len);
\end{verbatim}
{\tt bcast} will fill {\tt his\_answer} with pointers to vectors copied
out of the message, and will fill {\tt his\_len} with the corresponding
lengths, in elements.
\subsection*{What happens when the formats don't match?}
\index{{\tt bcast}, reply format mismatch}\index{\tt IE\_MISMATCH}
If the format of a reply does not match with the format specified
in the broadcast (or in {\tt msg\_get}, for that matter), the scan
will cease immediately and error code {\tt IE\_MISMATCH} is returned.
Since this presumably indicates a software bug, no attempt is made
to ensure that the user can delete any dynamically allocated
memory or messages that were scanned prior to the error.
\subsection*{When do broadcasts block?}
\index{{\tt bcast}, when will it block}
You should assume that any call to {\tt bcast}
could block long allowing other tasks to run.
Specifically,
if you do a broadcast that waits for replies, the calling task will block
until the replies are received, and other messages can arrive during
this period.
Thus, you will want to be wary of possible reentrancy bugs when issuing a
group-RPC that expects replies.
Keep in mind that other tasks can run while your RPC is blocked,
and that things like the site-view and the membership of process-groups can
change, perhaps triggering watch and monitor routines.
Also, if you call {\tt pg\_getview} or {\tt pg\_lookup} prior to doing a broadcast to
a group to which the caller does not actually belong, be aware that
these calls will also block.
\index{\tt pg\_getview}\index{\tt pg\_lookup}
Broadcasts for which no replies are requested will usually return
without blocking. However there are some rare situations in which
such a broadcast actually does block too.
This is also true when the {\tt f} (``fork'') option, described below, is employed
for a broadcast that does need replies.
\index{{\tt bcast}, fork option}
\subsection*{Broadcast, long form}
In the {\em long form}, a broadcast has the following format:
\index{{\tt bcast}, long form}\index{\tt bcast\_l}
\begin{verbatim}
nresp = [x]bcast_l("options",
, , nwant, );
\end{verbatim}
The arguments are as follows:
\begin{description}
\item[options] A string composed of characters from {\tt lmsrxfzT\#Pname}.
Option {\tt l} indicates that the address is specified as a list.
\index{{\tt bcast}, address specified as a list}
Options {\tt m}, {\tt s} and {\tt r}
determine how the broadcast routine interprets the out-info and in-info information.
Options {\tt T\#} specifies a timeout after which the broadcast will be
interrupted.
Options {\tt x}, {\tt f}, {\tt z} and {\tt P}
indicate how the broadcast should be sent, as described below.
\item[address] If option {\tt l} is not given, this will be a process or
group address and an entry number, as for a short-form broadcast.
Otherwise, it points to a null-terminated list of addresses.
The address(es) are expanded into the destination list to which the
message will be sent.
\item[out-info] This is either a format string and its arguments, as for
the short form, or a pointer to the message to send if option {\tt m} or {\tt s} is
specified.
\item[nwant] This gives the number of replies to wait for.
\index{{\tt bcast}, number of replies desired}
It should either be a non-negative integer or one of the constants
{\tt ALL} or {\tt MAJORITY}.\index{\tt ALL}\index{\tt MAJORITY}
\item[reply-info] This is either a format string and its arguments, as for
the short form, or if option {\tt m} or {\tt r} is given,
a pointer to a vector of message pointers at
least {\tt nwant} long, in which pointers to the reply messages will be saved.
The number of replies will never exceed {\tt MAX\_PROCS}, currently 128, which is the internal
limit on the number of destinations to which a single broadcast can be sent.
\index{\tt MAX\_PROCS}
\end{description}
The meaning of the remaining options are as follows:
\begin{description}
\item [T\#] Interrupt the broadcast after \# seconds and return 0 to the
caller.
\item [f] Fork off the broadcast as a task; the caller will rendezvous with
it later to collect the results.
This option may not be combined with option T.
The use of this option is discussed in Section 7.2.
\item [x] Specifies that the sender of the message should not be sent a copy, even
if listed as a destination (``exclude the sender'').
This is useful when a sender takes a local action and then sends
an asynchronous broadcast to inform other processes of the outcome of
that action.
\item [z] Specifies that the message should be sent lazily.
ISIS may delay transmission to attempt to optimize its use of I/O
channels or to piggyback.
Transmission will always occur within about 2 seconds.
\item [P] is used in connection with the BYPASS
mechanism in ISIS V2.0.
This option allows the user to convince ISIS to employ a
non-standard CBCAST transport protocol, and is discussed in Chapter 13.
There is a standard way to declare such a protocol to ISIS;
by specifying option {\tt Pname} one can force ISIS
to use declared protocol {\em name}.
\end{description}
Users should be aware of one additional error code, {\tt IE\_RESTRICTED},
which can arise when using broadcast option {\tt l}.
\index{\tt IE\_RESTRICTED}
This error means that the destination list contained more than two
destinations, one of which was a group
to which sender is
neither a member nor an officially sanctioned client.
A process that is not a member or a client of
a group can broadcast to a group, but not to a list
including other destinations in the same atomic message delivery.
This restriction arises from the way address expansions are done in ISIS.
To avoid the problem, the caller should be added to the group as a client,
using {\tt pg\_client}.
\subsection*{Reply, short form}
\index{{\tt reply}, short form}
The short form of a reply looks like:
\begin{verbatim}
reply(mp, fmt, args);
\end{verbatim}
A reply message is generated according to the format and sent to the
process that transmitted {\tt mp}.
The reply will be discarded silently if not desired by that process.
\subsection*{Forwarding a message to someone else who will reply}
\index{\tt forward}
The {\tt forward} routine will forward a message to a process that
can then reply on behalf of the original recipient.
The interface is:
\begin{verbatim}
forward(fmp, to, entry, cmp)
message *fmt, *cmp;
address *to;
\end{verbatim}
Here, {\tt fmp} is the message the recipient received, {\tt to} and {\tt entry}
specify the destination, and {\tt cmp} is the message to actually send to
the destination.
Although {\tt cmp} would normally be the same as {\tt fmp},
ISIS does not require this---any message at all can be used.
A reply to {\tt cmp} will be delivered to the process that sent {\tt fmp}.
Moreover, should the destination that received {\tt cmp} apply {\tt msg\_getsender}
to it, the result will be the address of the sender of {\tt fmp}.
However, {\tt msg\_isforwarded(cmp)} will return {\tt TRUE}, and
{\tt msg\_gettruesender(cmp)} will return the address of the process that
invoked {\tt forward}.
\subsection*{Null reply}
A process sends a {\em null reply} to indicate that it will not
be sending a normal reply.
This is used to prevent the caller
from waiting for a normal reply in cases where the caller
requested replies from {\tt ALL} group members.
The interface is:
\index{\tt nullreply}
\begin{verbatim}
nullreply(mp);
\end{verbatim}
\subsection*{Abort reply}
\index{\tt abortreply}
The {\tt abortreply(mp)} routine causes a broadcast to abort.
The caller will receive error code {\tt IE\_ABORT} and any
replies that had been collected by the system will be discarded.
\subsection*{Reply, long form}
\index{{\tt reply}, long form}
The long form of a reply is quite similar to the long form of a broadcast,
but with a restricted set of options:
\begin{verbatim}
reply("options", mp, [cc-dests], );
\end{verbatim}
Here, the options supported are:
\begin{description}
\item[c] A list of additional destinations is supplied. These will
receive carbon-copies of the message.
If {\tt c} is not given, this address list is omitted.
{\em Note: replies are sent using {\tt cbcast}, hence
the {\tt cbcast} delivery guarantees also apply to replies.}
\item[x] The sender should be excluded as a destination when the
address list is expanded.
\item[f] The reply is sent using a FIFO broadcast primitive ({\tt fbcast})
rather than {\tt cbcast}. This option is recommended for use by
skilled ISIS programmers who develop servers
that interact with large numbers of unrelated clients.
Option {\tt f} yields a substantial performance improvement in such cases, and
in such a star-shaped communication configuration
{\tt cbcast} delivery properties are almost never needed.
\item[m] If specified, the out-info is a pointer to the message to send; otherwise,
the out-info gives a format and arguments for constructing the message.
\end{description}
\section{Forking off a remote procedure call as a task}
When the {\tt f} option is specified in a broadcast, a task is
created to run the broadcast asynchronously from the caller and
collect the replies.
The caller should rendezvous with the task prior to using the replies in his code.
The basic approach is as follows.
First, the broadcast is done, and the result it returns is noted:
\begin{verbatim}
bcid = bcast_l("f", ....);
\end{verbatim}
The number of replies requested should always be non-zero; otherwise, it is
not necessary to make use of the {\tt f} option.
\index{{\tt bcast}, fork option}
\index{broadcasts, event identifier (bcid)}\index{\tt bcid}
The {\em broadcast identifier} {\tt bcid} can then be used in the following
ways.
\begin{description}
\item[\tt bc\_wait(bcid)] Blocks until the broadcast terminates,
\index{\tt bc\_wait}
then returns what the broadcast would have returned---either
an error code, or the number of replies that
were collected.
\item[\tt event\_id *bc\_getevent(bcid)] Converts the broadcast-id into an event id,
\index{\tt bc\_getevent}
and returns a pointer to the event\_id structure.
The pointer remains valid indefinitely (we plan to eventually provide
a way to garbage collect these, but none has been implemented yet).
This indirect way of obtaining event-id's is used because {\tt bcast}
is supposed to return an integer and an event-id points to a data structure
consisting of the sender's address and the message-id that was used for the broadcast.
\item[\tt bc\_poll(bcid)] True (1) if the broadcast is finished, false (0) if not.
\index{\tt bc\_poll}
\item[\tt bc\_cancel(bcid)] This operation succeeds, returning
1, if the broadcast had not yet
\index{\tt bc\_cancel}
been delivered, in which case it has been canceled at all
its destinations. It fails, returning 0, if delivery already took place.
Relevant only if option {\tt f} was specified when the broadcast was done.
\end{description}
The {\tt event\_id} structure is declared as follows:
\index{{\tt event\_id} structure}
\begin{verbatim}
typedef struct event_id
{
int e_msgid; /* Message id number, unique to sender */
int e_op; /* Operation to invoke */
address e_pname; /* Address of sender */
} event_id;
\end{verbatim}
Event-id's are either obtained using {\tt bc\_getevent(bcid)} in the sender
of an asynchronous broadcast that was started using the {\tt `f'} option,
or by referring to the global variable {\tt my\_eid} inside an
request handler.
To illustrate the use of this option, consider the following problem.
A process wishes to perform some computation on successive
blocks of data, which it obtains by sending a request to some server.
The computation is slow, and it is desirable to overlap the computation of one
block with the request to fetch the next block.
A piece of code to do this is as follows:
\begin{verbatim}
/*
* A procedure that overlaps fetching data with processing it
*/
async_computation()
{
/* Start the first request before looping */
bcid = bcast_l("f", .... first request ....);
do
{
old_bcid = bcid;
if(not-yet-done)
/* Start the next request */
bcid = bcast_l("f", .... next request ....);
/* Now wait for the prior one to complete */
bc_wait(old_bcid);
.... compute using results of the prior request ....
}
while(not-yet_done);
}
\end{verbatim}
A typical strategy would be to double-buffer, reading the results of the
next request into {\tt buffer[n]} while computing using
the contents of {\tt buffer[n - 1]}.
Notice that the destination of the broadcast takes no special provision to
handle the fact that the caller is using the {\tt bcast `f'} option.
\subsection {New BYPASS feature}
ISIS V2.0 supports a new implementation of the broadcast primitives
\index{bypass communication}
\index{multicast transport layer}
\index{{\tt cbcast}, bypass version}
\index{{\tt isis\_transport} }
called {\tt fbcast}, {\tt cbcast} and {\tt abcast} that
is {\em much faster} when multicasting to process groups under
certain restrictions.
Under some situations, literally
hundreds of multicasts per second can be transmitted using this
new mechanism on SUN 3/60 class machines, and it also permits ISIS to
take advantage of any nifty (and unusally fast) way you might have of
sending messages to a list of destinations.
It also supports a process-list facility for broadcasting to subsets
of the members of a group without paying the cost of a {\tt pg\_subgroup}
operation. The mechanism is described in Chapter 13.
option}
\index{broadcasts, event identifier (bcid)}\index{\tt bcid}
The {\em broadcast identifier} {\tt bcid} can then be used in the following
ways.
\begin{description}
\item[\tt bc\_wait(bcid)] Blocks until the broadcast terminates,
\index{\tt bc\_wait}
then returns what the broadcast would have returned---either
an error code, or the number of rbypass.tex 666 372 212 55355 4673506501 6220 One of the major costs in ISIS is associated with sending multicasts
\index{bypass communication}
\index{{\tt cbcast}, bypass version}
indirectly via the program called {\tt protos}.
This chapter describes a feature of ISIS V2.0 called the {\em bypass}
communication suite. Using it, messages are sent directly
to their destinations without indirecting through protos.
Large speedups result.
Moreover, the facility includes a way to communicate with subsets of groups
and to define new high speed transport protocols.
\section{When will bypass communication be used}
ISIS V2.0 automatically uses bypass communication for all multicasts that
satisfy the following addressing properties.
\begin{enumerate}
\item
The communication must
be to a single destination---not a list of addresses.
\item
This destination must be a group to which the caller belongs, or a
group of which it is a client (see {\tt pg\_client}), or a
\index{\tt pg\_client}
process list (see below) defined by the caller, or a single process
that belongs to a group to which the caller belongs.
\item
The broadcast protocol must be {\tt mbcast},
{\tt fbcast}, {\tt cbcast} or {\tt abcast}.
\end{enumerate}
A multicast that does not satisfy these rules will be sent via the
older protocols. This is transparent to the caller but slow.
However, transparency can be a source of bugs: one can imagine a
change to a program that would invalidate one of the above rules
and cause things to slow down drastically. To guard against this,
broadcast option {\tt B} can be specified. This option indicates that
the broadcast {\em must} be done by bypass mode.
Specifying the option won't change how ISIS
makes its decision, but it does cause ISIS to complain if a
multicast cannot be sent in bypass mode.
Bypass mode supports the broadcast {\tt x} option (excludes the sender),
and it also supports the new option {\tt Pname}, which forces the system to
employ a user-defined multicast transport protocol. These protocols
are described below.
Replies to a bypass multicast are always sent in bypass mode.
\section{How does bypass communication work?}
The bypass protocols work by adding a small amount of extra information
to your message and then sending it, by the fastest direct route available,
to the list of destinations obtained by expanding the destination
address. On arrival, messages are delayed if necessary to satisfy the
ordering requirements of the protocol selected. A paper on this
protocol is available from the Cornell ISIS group.
The effect is to make {\tt cbcast} extremely cheap (about .6ms
on the sending side, plus the cost of doing the IO operation, and about
.9ms on the receiving side, plus the cost of doing the read, on a SUN
3/60 system). This 1.5ms overhead will generally be far smaller than
the cost of doing the IO itself. A bypass RPC to a single destination
takes about 10.6ms to return, which is about 2ms more than for SUN
RPC on the same machines over the same network.
When using {\tt fbcast} about half of the sending and receiving cost is
eliminated and such an RPC costs about 9.8ms.
The bypass header is about 180-200 bytes per messages.
The primitive called {\tt mbcast} is essentially a raw interface
to the transport layer. It imposes almost no overhead at all, but
it also provides no virtual synchronous ordering or addressing
properties. This interface is beneficial when
an application needs some special property of the multicast transport
layer (such as a realtime delivery guarantee) and cannot risk delays
due to group membership synchronization that could interact with
those low-level properties.
This protocol should never be used by non-experts and messages
transmitted using this protocol should never be passed to
ISIS tools such as the coordinator-cohort algorithm, which requires
virtual synchrony. The recipients of an {\tt mbcast} may
see messages in different orders, or non-atomically, and may
see different group membership contents when a given message
is received.
Bypass messages are physically moved using a {\em multicast transport}
protocol. The one we provide is based on UDP and hence sends data
in a point-to-point manner. It slows down linearly with the number of
destinations, with a sending side cost of about 2ms on the SUN 3/60
and a receiving side cost of about 2.2ms, counting memory allocation.
In many cases there are more efficient protocols, especially if
you know something about your ethernet (perhaps it supports
hardware multicast) or your application (perhaps it has a tree
structure set of 200 destinations on a token bus). In such
cases, you can specify an alternative transport protocol and force
ISIS to send bypass messages using this. The higher level interface
is unchanged when using this, except that you need to plug your
transport protocol into the system using the {\tt isis\_transport}
system call, and to invoke it using a broadcast with the {\tt Pname}
option specified.
\section{Sources of overhead}
Bypass communication does introduce one new source of overhead:
there is a synchronization delay when process groups change membership,
or when switching between the {\tt protos} protocols and back.
Also, there may be a delay if a process is sending to one group and
suddenly switches to a different group.
\section{Process lists}
\index{process lists}
\index{{\tt pl\_create}} \index{{\tt pl\_add}} \index{{\tt pl\_remove}}
\index{{\tt pl\_delete}} \index{{\tt pl\_rank}} \index{{\tt pg\_rank}}
\index{{\tt pl\_getview}} \index{{\tt pg\_getview}} \index{{\tt pl\_makereal}}
The {\em process list} interface provides a way to avoid overhead
in the case where an application multicasts to varying subgroups of a
large and fairly static group. The mechanism is only intended for use
with the bypass communication facility.
The routine
{\tt list\_p = pl\_create(gaddr\_p, alist)}
is used to create a process list consisting of the null-terminated
list of processes {\tt alist}, which should be a subset of the
members of {\tt gaddr\_p}.
The list will later be
referenced using a {\em list pointer} of type {\tt address}, but
known only within the address space of the process that invoked {\tt pl\_create}.
The membership of the list
is automatically changed if the membership of the parent group
changes, or can be
manually modified by calls to {\tt pl\_add(list\_p, addr\_p)}
and and {\tt pl\_remove(list\_p, addr\_p)}, specifying the process
whose address should be, respectively, appended to the list or removed from it.
The entire list can be deleted by a call to {\tt pl\_delete}.
The only reason for creating a process list is to use it in bypass
multicast communication.
This permits the user to overcome one of the limitations on the
bypass communication protocols, namely that the destination be either
a single group to which the sender belongs or is a client, or a single process
that belongs to some group that the sender also belongs to.
When using a process list as the destination in a multicast, ISIS will
route the resulting communication via the so-called ``bypass'' protocol suite, just as if
the list address were a real process group address.
In contrast, if the same destinations were ``spelled out'' by passing a null-terminated
vector of their process
addresses directly to the multicast interface, the bypass protocols would not be used and
communication performance would suffer.
{\tt pl\_getview}
is an alias for
{\tt pg\_getview}.
When called with a {\tt list\_p} argument,
{\tt pg\_getview} will construct a {\em groupview}
data structure from a process list and return it.
Similarly, the routine
{\tt pl\_rank}
is an alias for
{\tt pg\_rank} and can be applied to a process list.
Despite these similarities, a
process list cannot be monitored
or used like a process-group address; instead, one should
monitor its parent group. Moreover, one cannot
pass a list address to a remote process (more specifically,
one can pass a list of processes and a group address with which
a remote process could use to create a new process list for its own
use; however, an existing process-list address is of no use to a
process other than its creator,
since the membership of the list is local to its creator).
In our view, these sorts of operations are not required and
support for them would impose exactly the sort of overhead that the
process list facility is intended to avoid.
If desired, a process list can be ``turned into'' a completely normal
process group using a call to
{\tt gaddr\_p \= pl\_makegroup(list\_p, gname)},
This creates a new process group
having the specified name and returns the new address.
Notice that when using a
process list the members do not know that
they belong to the list.
Consequently, the facility cannot be used to obtain fault-tolerance as in a
coordinator-cohort computation.
A further disadvantage is that process lists do not have names.
\section{Multicast transport protocols}
\index{multicast transport layer}
\index{{\tt isis\_transport} }
This section describes the interface used to extend the default
UDP-based multicast transport protocol with additional user-defined
protocols.
Such a delivery protocol has two major features; it may accept or reject
individual messages, and messages that it has accepted must be
delivered reliably to the groups or processes addressed.
It is somewhat weaker than a transport protocol, which must also order
messages; since ISIS is ordering messages at a higher level,
the delivery protocol need not repeat this work.
If a message is rejected by a user-supplied delivery protocol, it will
be transmitted using the default protocol, which accepts all
messages.
The basic interface is as follows:
\begin{verbatim}
include "isis.h"
/* Define a new transport protocol */
int pn = isis_transport(pname, mt_send, mt_groupview, mt_physdead);
char *pname;
int (*mt_send)(gaddr, exmode, mp, to, callback, arg0, arg1);
void (*mt_groupview)(ginfo_p), (*mt_physdead)(addr_p);
/* Interface to default transport protocol (pn 0) */
int net_send(gaddr, exmode, to, callback, arg0, arg1);"
/* Test status of a process */
int isis_querydead(who);
/* Break a big message into pieces */
void isis_fragment(size, mt_send, gaddr, exmode, mp, to, callback, arg0, arg1);
/* Hand a message into ISIS */
void isis_receipt(mp, from, pn);
/* Determine if a message might trigger a point-to-point reply */
int MAY_REPLY(mp);"
/* Yield the CPU briefly to let the delivery task run */
void t_yield();
\end{verbatim}
When defining a new transport protocol, you will need to provide ISIS with 3
routines:
\begin{enumerate}
\item {\bf mt\_send}: routine to send message to list of destinations.
\item {\bf mt\_groupview}: callback for new process-group views.
\item {\bf mt\_physdead}: hint about failures.
\end{enumerate}
Inform ISIS of the names of these routines by calling
{\bf isis\_transport(pname, mt\_send, mt\_groupview, mt\_physdead)}.
where {\tt pname} is an ascii name for the
transport protocol number you wish to define. A {\em protocol
number} between 1 and
${\tt MAXTRANSPORT}-1$ will be assigned to this protocol and returned.
Transport protocol 0 is pre-defined to correspond to the ISIS UDP
protocol.
You may find it convenient to call this protocol
for sending point-to-point messages from within your transport
code.
We explain how to do this below.
\subsection{Dealing with new group views and failures}
When communication with a group becomes a possibility, ISIS will call
{\tt mt\_groupview(ginfo\_p)}, giving the address of the
{\tt ginfo} structure about the group.
This routine will also be called every time the group membership
changes.
{\em It will be called in the same order at all processes that belong
to the group. }
Your protocol should expect to begin receiving messages from a
process anytime after this routine is called with a groupview
containing that process.
When a process fails, ISIS goes through a two-stage sequence.
First, the system may call {\tt mt\_physdead(addr\_p)}, giving the
address of that process.
It does this as a sort of a hint to your protocol because
there may be a delay before the group membership is changed,
e.g. because some messages are being flushed to ensure atomicity
of multicasts initiated by the process that crashed.
However, there are situations where this routine will not be called at all,
hence it should be treated as a hint and nothing more.
Calls to {\tt mt\_physdead} may come in {\em any} order at different
members.
However, ISIS will guarantee that if a member fails,
surviving members will either {\em eventually} see a call to
{\tt mt\_physdead} for this member, or a call to
{\tt mt\_groupview} with a view that does not contain that member.
And, calls to {\tt mt\_groupview} are done in the same order everywhere.
This is a property on which your transport protocol can depend.
Your protocol may sometimes detect apparent failures.
ISIS does not allow you to act on such failures directly, since
you could be wrong. However, it does provide a way for
your code to encourage ISIS to check the status of a
recalcitrant process.
This is done via the routine {\tt isis\_querydead(addr\_p)}.
{\tt isis\_querydead} doesn't keep track of process status that ISIS may have
reported directly to your code, so be careful to monitor
calls from {\tt mt\_physdead(addr\_p)} and {\tt mt\_groupview} if this is what
you need.
Needless to say, it is inadvisable to call {\tt isis\_querydead} repeatedly for the
same argument.
{\tt isis\_querydead(addr\_p)}
returns 1 if ISIS believes that this member is dead, and 0
otherwise.
It does this by probing
the indicated process to see if it is responsive; while doing so, the call
may block for a significant period of time.
Thus, if 0 is returned, the process {\em actually responded to a probe
message after the call was done.}
This action will cause ISIS to notice if the site where the destination is
running has crashed, but will {\em not} detect the fact that a process
has gone into an infinite loop.
Note: after calling {\tt isis\_querydead} and before it returns, a call
to {\tt mt\_physdead(addr\_p)} and {\tt mt\_groupview} might occur.
Your code should be designed to operate correctly in these cases, for example by
using some sort of flag.
One way to avoid this issue is to run {\tt isis\_querydead(addr\_p)} asynchronously, as follows:
{\tt t\_fork(isis\_querydead, addr\_p);}
This creates a new task to do the query call; the result is discarded.
The task that calls {\tt t\_fork} will not be blocked while the query occurs.
\subsection{Basic transport protocol interface}
The basic transport communication interface is through the send routine.
Calls to {\tt mt\_send} have the following interface:
\begin{verbatim}
mt_send(gaddr, exmode, mp, to, callback, arg0, arg1)
address *gaddr;
register message *mp;
register address *to;
int (*callback)();
char *arg0, *arg1;
{
....
}
\end{verbatim}
Basically, a call to {\tt mt\_send} requests that {\tt mp} be
transmitted to the destinations in {\tt to} and that the
{\tt callback} routine be invoked when the message is known to have
been delivered or the destination is dead.
The destinations in {\em to} are guaranteed to be a proper subset of the
members and clients of the group.
(Your protocol may or may not take advantage of this---ISIS will throw
away surplus messages).
If the callback routine is not specified as a null pointer, you
should do the callback separately for each destination, as
follows: {\tt (*callback)(addr\_p, arg0, arg1);}, where
{\tt addr\_p} is a pointer to the address of the destination in
question.
It is important that you not do the callback until the messages
have reached their remote destinations safely, as this is one of the
tools ISIS uses to decide how long to keep spare copies of a message
to ensure atomicity after failures.
Your protocol may reject a request to send a message, by returning -1; it
should return 0 if the message was accepted.
In the former case, ISIS will transmit the message in question using transport
protocol 0.
The protocol 0 transport routine is called {\tt net\_send};
it transmits messages using the UDP packet protocol.
You are free to call {\tt net\_send } if your protocol has a need
for reliable point-to-point messaging.
However, be aware that {\tt net\_send} uses acknowledgement packets to
ensure that delivery is reliable.
There is no unacknowledged version of the {\tt net\_send} protocol.
This means that {\tt net\_send} is not a particularly good way
to send acknowledgement packets needed by your own protocol, unless
you want to be absolutely sure they reach their destination.
(Using an acknowledged protocol to send the acks for protocol {\tt n}
could effectively double its acknowledgement traffic).
When a process is declared dead, whether from {\tt mt\_groupview}
or {\tt mt\_physdead}, it should be treated
like a sink.
All messages to that process (if any) should be
discarded and ISIS should be told that any pending messages
(and any future attempts to send) have terminated, by calling the
specified {\tt callback} procedure (if the pointer was non-null).
\subsection{Self-addressed messages and exclusion mode flag}
In general, {\tt to} may include the address of the {\em sender}.
That is, for some messages, there will be an address in this
null-terminated list for which {\tt addr\_ismine()} returns TRUE.
The {\tt exmode} flag is set to 1 if this copy of the message {\em
should be ignored}.
In this case, you should transmit the packet to all addresses except this
one.
On the other hand, if {\tt exmode} is 0, this ``local'' copy of the message
can be delivered whenever your protocol is ready to do soa
(immediately if you like) by calling
{\tt isis\_receipt(mp, \&my\_address, pn).}
Here, {\tt pn} is the protocol transport number you picked for your
protocol.
Notice that if {\tt exmode} is set to 1 and {\tt to} only lists one process
for which {\tt addr\_ismine} returns true, your protocol will not
need to do any work, but must call the
callback routine if the pointer is non-null.
\subsection{Delivery of messages from remote processes}
In the case of a receipt of a message from some remote
destination, call {\tt isis\_receipt(mp, addr\_p, pn)},
specifying the address from which the message arrived and the
protocol transport number that received it.
{\tt isis\_receipt} will put messages in sequence and detect and reject
duplicates, so your protocol need not worry about doing this.
The application will get stuck, however, if your protocol accepts a message
but never gets around to delivering it at some destination to which
the message is addressed.
The client dump contains enough information to figure out that this has
happened, but you need to suspect the problem to know where to look.
\subsection{Other useful routines}
ISIS provides several routines that we find helpful in designing transport
protocols.
For example, your protocol may need to send a message to acknowledge receipt
of message {\tt mp}.
Should it send the acknowledgement immediately, or wait a little while
in the hope that some other message will be sent back to the
sender of {\tt mp}?
Obviously, ISIS can't predict the future, but it can tell if the sender
of {\tt mp} is waiting for one or more replies.
If so, your protocol might want to wait for the routine that {\tt mp}
is delivered to to run, in the hope that it will generate such a
reply (if a multicast is received using protocol {\tt pn}, any
reply to it will be presented first to the {\tt mt\_send} routine
for protocol {\tt pn}).
The predicate {\tt MAY\_REPLY(mp)} tells if reply to {\tt mp} is possible.
If this predicate is false, the message will not generate a reply.
If this predicate is true, the message might generate a reply fairly
soon.
This is a hint to the delivery protocol not to send
an acknowledgement immediately, as there is a chance that the
acknowledgement can be piggybacked on the reply message.
But, how long should your protocol wait?
After all, {\tt MAY\_REPLY(mp)} only gives a hint, and perhaps no reply
will be sent!
To overcome this problem, ISIS provides a routine {\tt t\_yield}.
If your protocol's receiving task does a {\tt t\_yield()}, ISIS will
deliver the message {\tt mp}.
If a reply gets sent immediately, your {\tt mt\_send} routine will be
called while the receiving task is still suspended.
If no reply is sent, or {\tt isis\_receipt} can't deliver {\tt mp}
promptly, {\tt t\_yield} will return and you can send the
acknowledgement as a separate packet.
Note that {\tt t\_yield} returns no indication of what happened; you
are expected to keep track of this yourself using global flags.
The {\tt MAY\_REPLY} predicate is uniformly true or false at all processors
that receive a given message.
One easy mistake is to forget that {\tt msg\_read} creates a message.
Don't forget to do a {\tt msg\_delete} after your code has finished
with such a message, or with one extracted from inside another
message using the {\tt \%m} format item.
\subsection{Fragmenting large messages}
Your protocol may have a size limit on messages.
To check the size of a message, call {\tt msg\_getlen(mp)}.
If a message is too long for your protocol, you may call
{\tt isis\_fragment(size, mt\_send, gaddr, exmode, mp, to, callback, arg0,
arg1) }.
This routine will repeatedly call the specified {\tt mt\_send} routine
with fragments of the message pointed to by {\tt mp} that are
no larger than {\tt size}.
The remaining arguments to {\tt mt\_fragment} will be passed to the
send routine unchanged.
The {\tt gaddr}, {\tt exmode}, and {\em to} arguments will be passed to
the send routine unchanged.
However, the callback routine will be passed as a null pointer on all but
the last fragment of the message.
This way, the user-supplied callback will not be called until after all
fragments of the message have been delivered.
\subsection{When your protocol will be used}
ISIS will always use protocol 0 by default.
To convince ISIS to use your protocol, use {\tt mbcast\_l}, {\tt fbcast\_l},
{\tt cbcast\_l} or {\tt abcast\_l},
specifying the option {\tt ``Pname''} where {\tt name} is the protocol
transport name you used.
The mbcast protocol gives FIFO ordering (like FBCAST) but might not be
atomic in the event that the sender fails during transmission.
The other protocols are atomic and
give FIFO delivery order, CBCAST order, and ABCAST order
respectively, and each is more costly than the preceding one.
Also, the others are ordered with respect to GBCAST invocations, while
{\tt mbcast\_l} might not be.
{\tt mbcast\_l}
gives the very fastest possible multicast in ISIS, short of calling
your transport protocol directly.
Note that ISIS will not use your protocol if you don't obey
the various restrictions on destination address.
In such cases, the multicast will work using the old, slow ISIS mechanism.
We expect that support for external delivery protocols will gradually
improve in future releases of ISIS.
One facility we are considering will allow a protocol to learn something
about network topology at runtime, for example to determine if
all of a set of processes are on the same ethernet or token ring.
r message will be sent back to the
sender of {\tt mp}?
Obviously, ISIS can't predict the future, but it can tell if the sender
of {\tt mp} is waiting for one or more replies.
If so, your protocol might want to wait for the routine that {\tt mp}
is delivered to to run, in tcdump.tex 666 420 212 33215 4662553341 6012 ISIS provides a few facilities for printing information about the\index{client dumps, interpreting}
state of a process while it is running, and for
generating a dump showing the state of various system data structures within
the process (and within the ISIS protocols kernel as well).
If you want to print information on your own, the most useful routines are
{\tt paddr(addr)}, which prints an address structure in the form {\tt (site/incarn:pid.entry)}
\index{client dumps, generating with \tt cl\_dump()}\index{\tt paddr}\index{\tt pmsg}\index{\tt print}
or for a group, {\tt (gaddr site/incarn:gid[entry])}.
Entry numbers are often listed as 0 or as {\tt rcv\_reply} when the value
has not actually been set to anything explicit.
To print a null-terminated list of addresses, such as one finds within the various view
data structures, call {\tt paddrs(alist)}.
Neither prints a newline.
To print a message, one normally calls {\tt pmsg(msg)}.
This routine prints something like:
\begin{verbatim}
MSG from (5/0:1652.0) to (5/0:1652.query) (5/0:1653.3) sid 110 size 268
\end{verbatim}
The output string is terminated by a newline.
It lists the sender address (the entry information is
normally meaningless), the destination(s) (the entry will print as
a text string if the mapping is known, and as an integer otherwise),
the session-id used to match this message with replies sent in response to it, if any,
and the size in bytes.
The session-id may be absent if nobody expects a reply to this message.
Much more information can be printed about a message by calling {\tt msg\_printaccess(msg)},
but unsophisticated users are unlikely to find this level of detail useful.
Finally, ISIS provides a version of {\tt printf} called {\tt print},
which is called the same way as {\tt printf}, but has
two special features: it flushes after every call, and its output can
be redirected into a log file.
All messages generated within ISIS come from print, except in the
rare situations where the system calls the UNIX {\tt perror} function after
a failed system call.
To divert the output from {\tt print} into a log file, the user calls
{\tt isis\_logging(1)}.
Later, output can be
redirected to standard output by calling {\tt isis\_logging(0)}.
While logging is enabled,
dumps and all other ISIS-generated output will
go into the file {\tt .log}, where {\tt } is
the process-id number of the client process.
ISIS provides a
sophisticated state ``dump'' facility, which is linked into
client processes automatically.
A client dump can be obtained by calling the routine {\tt cl\_dump(DUMP\_ALL, msg)},
where {\tt DUMP\_ALL} indicates that a complete dump is desired, as described below,
and {\tt msg} is a message to print as the reason for the dump.
A second way to trigger a dump is to send a {\tt HANGUP} signal to the
client process.
By default, ISIS traps these signals and prints a dump when they
occur.
The output of a client dump sent this way is currently spooled automatically, using the
{\tt isis\_logging} feature described above. This is helpful when
an ISIS client appears to be misbehaving but its console output is
going to an inconvenient place.
A message will also be printed on standard output indicating that a client
dump is being spooled.
Hangup signals are a relic of the past and the fact that ISIS does this
should not affect the correct execution of your application.
If this poses a problem for your application,
edit the ISIS source file {\tt isis.c} and change the signal number used.
To send such a signal, use the UNIX {\tt kill} command, for example {\tt kill -USR2 12345}
or {\tt kill -1 12345}.
\index{client dumps, generating with {\tt kill} command}
The dump level specifies the degree of detail desired in the dump.
Inside the protos part of ISIS, this level can be controlled fairly
carefully, but in the current client library only the maximum level, {\tt DUMP\_ALL},
can be given.
A facility for making
more selective dumps will be introduced in the future.
This is a typical client dump. It was made while the program called
``twenty questions''
was starting up, using a {\tt SIGUSR2} signal:
\begin{verbatim}
CLIENT PROCESS 16477 INTERNAL DUMP REQUESTED: received signal CLDUMP
isis internal state vector: [isup], isis_nblock 0
Memory mgt: 222 allocs, 19 frees, 85400 bytes in use
Message counts: 114 allocs 111 frees (3 in use), 0 enqueued by MSG_ENQUEUE
*** Ignored 3 replies, 3 were nullreplies (no bcast-task was waiting)
Tasks: 16 created, 38 light-weight context switches
TASK [46188, pc=0x1e488 sp=0xf7fff958]:
(0x0), wait (isis_mainloop waiting forever), activity=1
TASK [72a04]:
isis:blocking-listener(0x1), ** running **, activity=1
TASK [78af8, pc=0x1e488 sp=0x7e658]:
strader_maintask(0x8100b), sent to (gaddr=4/0.5[2])
Wants 1 of 3 replies, got 0+2 null, msgid=31
Dests: (4/0:16475.2)(4/0:16476.2)(4/0:16477.receive_order)
Status , startup task, activity=0
Activities:
Slot 0 = (system activity), 1 pending watch/monitor requests and messages
Slot 1 = (act=4/0:16477.1), 2 pending watch/monitor requests and messages
Message queues:
Monitoring / watching:
[act 0 swid 1]: monitoring site-views, Call pg_watch:w_new_sl(0)
[act 1 wid 80000001]: Watch process (4/0:16475.0). Call pg_watch:pg_detect_pfa
ilure(4c688)
[act 1 wid 40000001]: Watch group (gaddr=4/0.4[0]). On total failure call pg_l
ookup:refresh_gaddr_cache(48af8)
Site view 4/1:
fafnir.cs.cornell.edu [site_no 4 site_incarn 0]
Interclient dump: 34 GT calls
(4/0:16476.0) [fafnir.cs.cornell.edu]:
estab;alive; got: 14/14+0 dups, sent: 13+0 ret, 1 acks, backlog 0
(4/0:16475.0) [fafnir.cs.cornell.edu]:
estab;alive; got: 6/6+0 dups, sent: 8+0 ret, 1 acks, backlog 0
Message tank: 0 messages, 0 bytes
Process group views:
Group client_group, (gaddr=4/0.5[0])
View 3.0, 3 members = (4/0:16475.0)(4/0:16476.0)(4/0:16477.0), seqn: mine 5/
vid 3.0 [ 0 5 5 ]
... bypass viewid 3.0 nmemb 3 msgcnt 0 bsndcnt 0 brcvcnt 0
... next bypass viewid 3.0 nmemb 3, flushes counted 3 want -1
... current protos viewid 3.0 nmemb 3
Cached group address position_trader, (gaddr=4/0.4[0])
Bypass procs: 0 active groups
Group (gaddr=4/0.5[0])
Sender (4/0:16476.0)... garbage collected through seqn 5.
Abcast ordering queue:
Dump of bc_nodes:
\end{verbatim}
(This dump was slightly edited to fit nicely on the page, but
we didn't change it in any important ways).
From this dump we see that the process-id of the program was
16477. The memory management and message statistics are fairly typical.
Large amounts of memory, say hundreds of k-bytes, or large numbers of
active messages would indicate a likely problem with the application
software, for example a failure to delete messages that had
been created or extracted from other messages.
ISIS itself rarely needs very much memory or very large numbers of
messages.
Next, the dump describes the set of active tasks for this process.
In this case there are three such tasks.
For each of these tasks, ISIS prints out the stack frame in which it
blocked -- except for the second task, which was actively running
when the dump was made. It also prints the ``name'' of the task (a
string that comes from {\tt isis\_task}). In this example,
the three tasks include the main thread, which called {\tt isis\_mainloop}
and is looping forever, a system-generated task called the scheduler,
and a user-created task called {\tt strader\_maintask}. We are told that
this task was the startup-task (i.e. it was spawned in the call to
{\tt isis\_mainloop}, that it has initial argument 0x8100b,
that it is doing a multicast to group
{\tt client\_group} (you can look up the group address in the section of
the dump that discusses process group views), and that the
multicast expended to three destinations. It wants a single reply
but so far has only received null replies, from two destinations
(the second two in the list: 16476 and 16477).
The task is still waiting for a
reply from the remaining destination (16475). The status vector
({\tt WNN}) encodes the state of each pending reply as W for waiting,
R for received, N for null, and D for died.
The {\em activity id} is only used when a process has not yet called
{\tt isis\_start\_done()}; in this case, only messages that
lists the same activity id as the startup task can be delivered, and
others will queue up.
Should messages queue up for this or some other reason, they would be
listed after the activity-id table, like this:
\begin{verbatim}
Messages queues: activity 1 holds mutual exclusion
[act *] MSG 599cc
from (4/0:pr.rcv_reply) to (4/0:10498.new_view) viewid 5.0 size 304
[act *] MSG 5a9c8
from (4/0:10498.0) to (4/0:10496.1)(4/0:10498.receive)
(4/0:10497.1)(4/0:10500.1)(4/0:10499.1) msgid 1093 size 344
[act *] MSG 59318
from (4/0:10497.0) to (4/0:10496.1)(4/0:10498.receive)
(4/0:10497.1)(4/0:10500.1)(4/0:10499.1) msgid 1093 size 344
\end{verbatim}
This would show that the process has not yet called {\tt isis\_start\_done}
and has received three messages that won't be delivered until it does.
The first message installs a new view of some process group (view 5),
presumably a group that the process joined as part of its start sequence.
The next two messages are multicasts that were probably sent to that
group. The activity-id is shown as {\tt \*} for these messages,
meaning that they were sent by processes that had already called
{\tt isis\_start\_done} themselves, and the sender and destinations
are listed for each message. The message-id's will match with the id's
under which the correspinding tasks are waiting for replies (if they
wanted replies!).
In general, the isis system
routines call {\tt t\_wait\_l()}\index{\tt t\_wait\_l()} when they
block and include a descriptive message.
These appear in the TASK part of the dump.
It is hoped that this results in a reasonably easily read dump.
Depending on the style of main program that you adopt, you will often
see a task running {\tt isis\_mainloop} ``waiting forever'',
and a task called the {\tt blocking-listener}.
The former is an artifact of the way that {\tt isis\_mainloop} is
implemented, and the latter is the routine that calls {\tt select}
to block when ISIS has nothing else to do.
If you have tasks that enter ISIS without a {\tt t\_fork} being
done, they may not be described as well as for tasks that enter ISIS ``normally''.
If the application uses the token tool, a dump of the tokens
known to the process would appear next.
For each token, the group in which it lives, the token name, the current holder,
a set of flags, and the pending requests (if any) are shown.
ISIS probably should explicitly indicate when a task is blocked waiting for
a token, but it presently does not do so.
A task that wants a token will be shown as waiting for messages.
However, the task will list a session-id that will match up with
one of the pending requests on a token request queue in this case.
Next in the dump we see a printout of the current site-view, plus
statistics for the BYPASS communication channels between this
process and others with which it communicates directly.
Here, we see that site 4 is the only one operational and that
this process has been communicating directly with processes
16476 and 16475.
It has called the UNIX {\tt gettimeofday} system call 34 times (we print
this because the silly call costs a fortune under some UNIX systems).
Over the channel to process 16476, this process sent 13 data-packets and
one ack-packet (other acks were piggybacked) and didn't need to
retransmit any messages (hence 13+0). It received 14 data-packets
and no duplicates, and found 18 messages in those packets (i.e.
the system managed to send 2 or more messages at once in some packets).
There is no IO backlog at present and no messages in the output channel;
if there were, we would see a list of each outgoing packet and its
size, number of times retransmitted, etc.
The {\em message tank} is a holding buffer in which messages may be
delayed briefly between when they are received and when they are
processed, i.e. if the process is busy when a packet arrives.
In this dump, the tank is empty.
The process group views part of the dump starts with the last view
reported to the application program. This is viewid 3.0.
It also contains quite a bit of information about the status of the
BYPASS protocols, which is technical and not normally useful to
application programmers -- but might help us diagnose a problem
if you run into one.
In addition, the address one one other group, called {\tt position\_trader},
is known from a {\tt pg\_lookup} call.
Next, if you enable the
message ``trace''
facility, will appear a list of all currently allocated messages.
If you want to enable a trace (for example, to try and figure out
exactly which messages are tending to pile up), just set the global
variable {\tt msg\_tracemsgs} to 1.
Dumps will now list all the allocated messages.
In fact, you can generate just the part of the dump that lists these,
separately, by calling {\tt msg\_trace\_dump} within your code.
The views list members as well as clients for each group, with the client
addresses enclosed in square brackets.
If you run into a more complex dump you may need to contact us
for help interpreting it.
However, the discussion above covers the aspects that are normally
most relevant to the application programmer. For example, the dummp
shown above was made because the application in question had hung.
It shows clearly that the problem is a missing reply message and,
when we checked the process that failed to reply, we found that
a bug sometimes caused it to return from its request handling
routine without sending one.
So, this dump pointed pretty directly to a bug in the application
program.
\_wait\_l()}\index{\tt t\_wait\_l()} when they
block and include a descriptive message.
These appear in the TASK part of the dump.
It is hoped that this results in a reasonably easily read dump.
Depending on the style of main program that you adopt, you will often
see a task running {\tt isis\_mainloop} ``waiting forever'',
and a task called the {\tt blocking-listener}cmd.tex 666 420 420 21046 4440325343 5436 \label{sec:cmd}
The cmd tool\index{{\tt cmd} tool} is an interactive program that provides information
about the sites on which ISIS is currently running and the status
of the ISIS processes and process groups.
\subsection{Running the cmd tool}
\index{{\tt cmd} tool, how to run}
You start the cmd tool by typing:
\begin{verbatim}
cmd
\end{verbatim}
You may leave out the port number, in which cast {\tt cmd} will try
to look up the port number in the system file.
{\tt Cmd} will then prompt you to enter commands.
The following commands are supported:
\begin{tabbing}
group $<$group-\=name$>$m\=Cause the protocols process to shutd\=own.mm\= \kill
{\bf command} \>\> {\bf description} \> {\bf abbreviation} \\
sites \>\> List sites in the current site view. \>\> si \\
list $<$scope$>$ \>\> List process groups known in a scope. \>\> l \\
group $<$group-name$>$ \>\> Print a process group view. \>\> g \\
dump \>\> Dump the state of the protos process. \>\> d \\
pr\_dump \>\> Dump the protos process to its log file. \>\> pd \\
snapshot \>\> (As root) dumps every process in the \>\> snap \\
\>\> system to log files. \>\> \\
rescan \>\> Cause ISIS to rescan the sites file. \>\> re \\
shutdown \>\> Cause the protocols process to shutdown. \>\> sd \\
send $<$address$>$ $<$arg0$>$ $<$arg1$>$ \ldots\ $<$argn$>$ \\
\> \> Send a message to a set of processes. \>\> s \\
help \>\> Print a list of commands. \>\> h \\
help $<$command$>$ \>\> Print more information about a command. \>\> h \\
quit \>\> Quit the cmd program. \>\> q or \verb@^D@
\end{tabbing}
The commands {\sc sites}\index{{\sc sites} command in {\tt cmd} tool} and {\sc group}\index{{sc group} command}
call the routines {\tt sv\_getview}
and {\tt pg\_getview}, respectively, and print the returned view.
The {\sc list}\index{{\sc list} command in {\tt cmd} tool} command prints the name and address of all groups known within a
specified scope.
A scope is specified as {\tt @name}; {\tt @*} stands for the global scope.
You may use the form {\tt list @sname:pattern} to list only only groups with
names that match {\tt pattern}.
{\tt pattern} consists of ordinary characters, `*', `?', and `[\ldots]'
which are interpreted in the same way as by the unix {\tt csh}.
{\sc Dump}\index{{\sc dump} command in {\tt cmd} tool} prints the current status of the local protocols process.
See the section on dumps on how to interpret this information.
{\sc Pr\_dump}\index{{\sc pr\_dump} command in {\tt cmd} tool} causes the dump to be appended to the isis log file
instead of being printed on the screen.
{\sc Rescan}\index{{\sc rescan} command in {\tt cmd} tool} will cause ISIS to rescan the sites file.
It should be used if the sites file is changed while ISIS is running.
{\sc Shutdown}\index{{\sc shutdown} command in {\tt cmd} tool} will kill the local protocols process.
The {\sc send}\index{{\sc index} command in {\tt cmd} tool} command will be explained in the next section.
The cmd tool also accepts a sequence of commands,
separated by commas, on a single input line.
If the cmd tool is invoked as
\begin{verbatim}
cmd [port-number] command arg0 arg1 ...
\end{verbatim}
it will execute the specified command and quit immediately instead
of running interactively.
\subsection{Sending interactive messages}
\index{{\tt cmd} tool, sending interactive messages with}
The cmd tool may also be used as a universal interactive
front-end for controlling and debugging new ISIS applications
being developed.
Say for example, you are building an ISIS application that lives
as a process group with the name ``xgroup''.
For debugging purposes the xgroup supports a ``trace''
option that may be switched on and off dynamically.
Assume there also is a routine {\tt xdump} that dumps some information
to a file, and there are two variables `xx' and `xy' that
are of interest while debugging xgroup.
One could write a separate ISIS program that sends special messages
to xgroup in order to set the trace option, trigger a dump,
or query the value of xx and xy.
Instead, you may use the {\sc send}\index{{\sc send} command in {\tt cmd} tool} command in the cmd tool.
A session might look like this:
\begin{verbatim}
% cmd
cmd> group xgroup
*** gaddr = [cluster 0. site 1. incarn 0 : gid 1]
view = ["xgroup" incarn 1 viewid 2 nmemb 2 nclient 0]
members = [0.1.0:4252, 0.2.0:2061]
cmd> send xgroup set trace
0.1.0:4252 trace is now active
0.2.0:2061 trace is now active
cmd> send xgroup status
0.1.0:4252 xx = 15, xy = 27
0.2.0:2061 xx = 43, xy = -8
cmd> send xgroup dump
0.2.0:2061 dump completed
0.1.0:4252 dump completed
cmd> quit
%
\end{verbatim}
In this session xgroup has two members at sites~1 and~2.
Their addresses are 0.1.0:4252 and 0.2.0:2061
(cluster, site number, site incarnation, and process id).
In order for xgroup to work together with the cmd tool,
the xgroup program needs to provide a handler for the messages sent
by cmd.
Figure~\ref{fig:cmd} shows how
the code in the xgroup program would look like.
\begin{figure}
\begin{verbatim}
isis_entry(MSG_COMMAND, mh_command, "mh_command");
...
void mh_command(msg_p)
message *msg_p;
{
int argc;
char *argv[32];
char abuf[80];
getargs(msg_p, argc, argv, 32);
if (strcmp(argv[0], "set") == 0) {
if (strcmp(argv[1], "trace") == 0) {
trace_on = 1;
reply(msg_p, "%s", "trace is now active\n");
} else {
reply(msg_p, "%s", "unknown option\n");
}
} else if (strcmp(argv[0], "dump") == 0) {
xdump();
reply(msg_p, "%s", "dump completed\n");
} else if (strcmp(argv[0], "status") == 0) {
sprintf(abuf, " xx = %d, xy = %d\n", xx, yy);
reply(msg_p, "%s", abuf);
} else {
reply(msg_p, "%s", "unknown command\n");
}
}
\end{verbatim}
\caption{A typical command message handler}
\label{fig:cmd}
\end{figure}
When the command `{\tt send ...}'
is typed, the cmd tool will pack {\tt ...} into a
message and broadcast the message to the group.
{\tt MSG\_COMMAND} is a predefined entry point number to which it
will send its messages. The xgroup program must call
\begin{verbatim}
isis_entry(MSG_COMMAND, mh_command, "mh_command");
\end{verbatim}
to bind its command message handler to this entry point.
For convenience the macro
\begin{verbatim}
getargs(msg_p, argc, argv, nmax);
\end{verbatim}
is provided for extracting {\tt ...} from the message.
It will store the number of arguments sent in argc and a pointer
to each argument in {\tt argv[0]}, \ldots {\tt argv[argc-1]}.
{\tt Nmax} must contain the maximum number of arguments expected.
The cmd tool expects to receive an answer containing a character string
from each member of the group.
It will print each answer received, preceded by the address of the
replying process.
The {\sc send}\index{{\sc send} command in {\tt cmd} tool} command is not restricted to sending messages to
the whole group.
An address for a {\sc send} command may be specified in one of the
following forms:
\begin{description}
\item[$<$group$>$]
denotes all members of $<$group$>$,
\item[$<$site$>$:$<$group$>$]
denotes all members of $<$group$>$ at $<$site$>$
(specified by site number),
\item[$<$site$>$:$<$group$>$:$<$process$>$]
denotes the member of $<$group$>$ at $<$site$>$ with process id
$<$process$>$.
\end{description}
Furthermore, the group name may be left out or specified as `{\tt *}'
in which case the most recently referenced group
(in a {\sc group}\index{{\sc group} command in {\tt cmd} tool} or {\sc send} command) is used.
The following example illustrates the use of addresses in the
{\sc send} command:
\begin{verbatim}
% cmd
cmd> group xgroup
*** gaddr = [cluster 0. site 1. incarn 0 : gid 1]
view = ["xgroup" incarn 1 viewid 2 nmemb 2 nclient 0]
members = [0.1.0:4252, 0.2.0:2061]
cmd> send * status
0.1.0:4252 xx = 15, xy = 27
0.2.0:2061 xx = 43, xy = -8
cmd> send 2:xgroup status
0.2.0:2061 xx = 43, xy = -8
cmd> send 2:* status
0.2.0:2061 xx = 43, xy = -8
cmd> send 2 status
0.2.0:2061 xx = 43, xy = -8
cmd> send *:2061 status
0.2.0:2061 xx = 43, xy = -8
cmd> send :2061 status
0.2.0:2061 xx = 43, xy = -8
cmd> quit
%
\end{verbatim}
The first {\sc send} command is broadcast to the whole group.
The other commands illustrate several equivalent ways of sending
a command to the group member at site~2.
d> send xgroup dump
0.2.0:2061 dump completed
0.1.0:4252 dump completed
cmd> quit
%
\end{verbatim}
In this session xgroup has two members at sites~1 and~2.
Their addresses are 0.1.0:4252 and 0.2.0:2061
(cluster, site number, site incarnation, and process id).
In order for xgroup to work together with the cmd tool,
the xgroup program needs to provide a handler for the messages sent
by cmd.
Figure~\ref{fig:cmd} shows how
the code in the xgroup program would look likecontrol.tex 666 420 420 4523 4462647372 6351 When using ISIS in control applications, the major issue
involves physical input and output using devices and sensors.\index{files, used within ISIS applications}
This kind of I/O can be done in the normal manner, provided that
two conditions are met.
First, it is important to call the {\tt flush} primitive before doing an output,\index{{\tt flush}, with files or devices}
especially if your application
sends asynchronous {\tt cbcasts} that it treats as having been
delivered to their destinations.
The problem here is that virtual synchrony prevents a computation from
observing an inconsistency by communicating with a process that
should have received a message ``in the past'', but has actually
not yet seen it.
On the other hand, virtual synchrony does not prevent a process from
asynchronously broadcasting a message, then taking some irreversible
action that depends on the message, and then crashing.
If this happens, {\tt cbcast} is not required to deliver the message
unless a {\tt flush} was done prior to the output action.
Notice that {\tt flush} may block briefly while it waits for
any buffered broadcasts to actually be transmitted.
The same considerations apply when doing output to a file that might be
used after recovery from a failure.
When doing input, the primary issue is to avoid blocking,
since this would prevent ISIS from reading its own messages
and eventually cause your application to either hang or become congested and
crash.
The {\tt isis\_input}, {\tt isis\_input\_sig}, {\tt isis\_signal}, and
{\tt isis\_signal\_sig}
functions provide a simple solution to this problem.
\index{\tt isis\_input} \index{\tt isis\_input\_sig} \index{\tt isis\_signal} \index{\tt isis\_signal\_sig}
Each call to {\tt isis\_input(fdes,routine)} specifies a procedure that
will handle input as it becomes available on file descriptor {\tt fdes}.
ISIS will fork this routine off as a task, so the usual cautions regarding
re-entrancy and blocking should be respected.
The routine
{\tt isis\_input\_sig(fdes,\&cond,routine)},
will do a {\tt t\_sig(\&cond,routine)} if input becomes available on {\tt fdes}.
The system will maintain only a single input routine per
file descriptor.
An input monitor can be canceled by specifying a null procedure (or condition)
pointer.
The two {\tt isis\_signal} functions have the same interface, and
are used to catch UNIX signals.
sible
action that depends on the message, and then crashing.
If this happens, {\tt cbcast} is not required to deliver the message
unless a {\tt flush} was done prior to the d.tex 666 420 420 1046 4447461624 5106 \documentstyle{article}
\begin{document}
-
\vspace{3.5in}
\hspace{-.23in}This is an authorized copy of the book:
\begin{center}
{\underline {An advanced Course on Distributed
Systems}}
\end{center}
edited by Sape J. Mullender, copyright 1989 by the ACM Press, A
Division of the Association for Computing Machinery, Inc. (ACM), to be published by
Addison-Wesley Publushing Co. Permission is granted for sole
use of partipants in the Fingerlakes 1989 Advanced Course on Distributed Systems,
and further copying is strictly prohibited.
\end{document}
f input becomes available on {\tt fdes}.
The system will maintain only a single input routine per
file descriptor.
An input monitor can be canceled by specifying a null procedure (or condition)
pointer.
The two {\tt isis\_signal} functions have the same interface, and
are used to catch UNIX signals.
sible
action that depends on the message, and then crashing.
If this happens, {\tt cbcast} is not required to deliver the message
unless a {\tt flush} was done prior to the demos.tex 666 372 212 102076 4673715502 6042 \label{Ch:demos}
\section{Twenty Questions}
This \index{twenty questions demo}
demo program consists of two parts: the {\em twenty
questions server}, which maintains a database of
answers to a very restricted set of questions, and the {\em questions-answers} front-end,
which poses queries to the service and interprets the results.
The program is fully described in the ISIS paper that appeared
in the 11th Symposium on Operating Systems Principles.
To run it, copy the file ``questions.dat'' from the demos source directory to the place
where ISIS is running (normally, /usr/spool/isis).
Make sure that the pathname used in the rexec request in
twenty.c will work on your system (normally, /usr/games/twenty).
Start twenty questions up in that directory, for example:
\begin{verbatim}
kama% twenty
\end{verbatim}
(you can specify a port number if the default in /etc/services
is not set up correctly on your system).
The program will start up a total of 5 additional processes
picking the sites at which to do by stepping through the list of operational sites.
It is then operational.
If any subset of the twenty questions processes is killed,
new ones will be restarted by the survivors.
(See below for help shutting the program down).
Twenty questions maintains a database that has
columns titled guess (the thing you are trying to
guess is the category of object about which it is thinking)
and color, price, size, etc., which are categories
against which can may pose queries.
These are of the form {\tt category op value},
where the category must come from the list, the operator
is one of <, >, and = (no double-character operators!),
and the value is a number or a text string.
Sample queries are: {\tt price>15000000},
{\tt color=red}, {\tt guess=car}.
To query the database, run the qa program: {\tt qa} (or
{\tt qa port-number}).
{\tt qa} will print the categories available and
ask you to pick a random number, which is used to select
the class of object against which your queries will be
solved.
Each query can be typed in one of three ways: as a simple
query, in which case one of the servers will answer
(and you can always expect to get {\em exactly one}
answer),
with an asterisk in from (e.g. {\tt *price > 15000000})
in which case all active servers will answer (one is
a hot standby and won't respond, so expect 5 answers),
or with an {\tt @} in front, as in {\tt @price>15000000},
which works like an asterisk but repeats the query 50 times
consecutively, about twice per second if the system is
running on a SUN 3/50.
In the case of a simple query, the answer is computed
over the entire column that applies.
The answer will be one of {\tt yes, no} or {\tt sometimes},
meaning {\em always yes}, {\em always no}, or
{\em sometimes yes and sometimes no}.
The case of a query prefixed by {\tt *} or {\tt @} is
more complicated; here, each of the 5 servers is
responsible for handling some rows of the database, and
the answers each returns relate to the selected category
for its rows only.
For example, if the random number chosen is 1 (planes),
and the query is {\tt *price>1500000},
the answer will be printed as
\begin{verbatim}
sometimes no no no sometimes
\end{verbatim}
Twenty questions and qa demonstrate several aspects of
ISIS.
First, they are illustrative of the simple development
strategy used when building ISIS software -- both programs
are pretty trivial.
Second, they illustrate the sense in which all
members of a process group see the same events in the
same order.
Finally, they illustrate fault-tolerance.
You may want to kill some members of the twenty
questions process group while an {\tt @} query is
running.
You will either see no effect at all, or at worst will see
something like this:
\begin{verbatim}
sometimes no no no sometimes
sometimes no no no sometimes
sometimes no no no sometimes
sometimes no no no sometimes
twenty questions service is restarting... please be patient
... retrying
sometimes no no no sometimes
sometimes no no no sometimes
\end{verbatim}
Briefly, while the service was unavailable, {\tt qa} just
had to sit and wait.
By the way, {\tt qa} is what we call a {\em client}
of the twenty questions process group.
This gives somewhat faster query times then if the
program were to talk to {\tt twenty} without
being installed as a client.
You can remove the {\tt pg\_addclient} call in
{\tt twenty} to see the difference in performance.
Adding a client takes a little time, and this is what
causes the delay when {\tt qa} first starts up.
Without the {\tt pg\_addclient}, ISIS runs queries
slower, but the first query will be quite a bit
faster.
To shut the demo down, run {\tt qa} and type {\tt bye}.
{\bf *Note:} When running the twenty questions demo on some Sun systems
{\tt twenty} may fail with the following error:
\begin{verbatim}
ISIS: unable to exec : setgid(-2)Invalid argument
\end{verbatim}
See {\em The remote exec facility} Section~\ref{Note:rexec}
The workarounds include:
1) run rexec as something other than "root", e.g. some special "isis"
userid. This means that jobs it starts up will run under that userid.
2) "fix" the /etc/passwd entry for "nobody" giving user and group ids
that are reasonable, i.e. 999.
3) "fix" your copy of the twenty questions demo to supply a legitimate userid
and passwd.
\section{The bank program}
his is a toy bank application that illustrates the ISIS transaction mechanism.
\index{transactions, bank demo}
\index{bank demo}
The {\tt bank} program maintains a simple database of bank account balances,
and the {\tt teller} program queries and updates that database. You can run any
number of copies of the teller, but only one copy of the bank program
should be active. The teller has a {\tt help} command which explains how to use
it.
To get more of a feel for the transaction mechanism, try killing the bank
and restarting it. The bank will read its log and re-execute all
transactions that committed. In some cases the bank may not know the
outcome if the transaction was committing or aborting at the moment it was
killed. In such cases it will consult the transaction manager (the {\tt xmgr}
process started up when ISIS is initialized) to find out about these
transactions. When the bank is restarted, any still-running tellers will
notice that the old bank has died, giving the error ``unable to contact bank
service''. When the next command is entered the teller will connect to the
new bank process and continue.
The bank demo uses most of the features of the transaction mechanism.
Looking at the source of the bank and teller programs will give you some
idea of how what is needed in an ISIS transaction program. Briefly, the
application itself, or the database system it uses, must store data in
non-volatile or stable storage, manage some prepare/commit/abort ``bit'' for
the updates each transaction makes to that data, initiate recovery of
transactions after a crash, and provide concurrency control.
The bank demo is a toy in most respects. It is {\em not} a model of how to write
a high-performance, or large scale transaction application. The bank uses
an intentions list on secondary storage to record changes to the database
which is stored in primary memory. This means that upon crash and restart
the bank re-reads and executes all the updates it has ever done. Any real
application would have to checkpoint the database to secondary storage
periodically, or more likely, store the entire database on secondary
storage and cache part of it in primary memory.
Further, intentions lists are almost always slower than in-place updates
with write-ahead logs. With intentions lists, a transaction records the
updates it wishes to make in the log, and the updates are made to the real
data only when the transaction commits, causing a flurry of disk activity
during transaction commit. With write-ahead logs the updates are made to
the actual data during the transaction and the log is used to record
before-images of the records as they are modified. Upon transaction abort
the in-place updates are undone using the information in the log.
The bank uses a simple form of optimistic, non-serializable concurrency
control. Two balances are maintained for each bank account. The true
balance, and the {\em cleared} balance. During a transaction which withdraws \$X
from an account the cleared balance will be \$X less than the true balance.
Most of the time the balances are identical and the {\tt inquire} command will
only display the cleared balance if it differs from the true balance. A
real application would require more in the way of concurrency control. The
token tool (described in Chapter 11) would be a good choice
for this.
In the future we expect to write a more realistic bank demo, incorporating
multiple bank servers. Such a demo would demonstrate how the transaction
mechanism coordinates multi-site commit.
As this is the first release of the ISIS transaction mechanism we would
welcome comments from anyone who uses it. We are particularly interested in
people's experience with interfacing our transaction mechanism with a
pre-existing database system, especially one which you cannot modify to
suit ISIS. One of our goals is to make such inter-working successful.
\section{Parallel make program}
In this section we describe the
parallel make \index{parallel make demo} demo, which is actually quite a neat
piece of software.
The program was developed by Doug Voigt, of Hewlett Packard, Inc.
\subsection*{Using Parallel Make}
Parallel make (pmake) is an ISIS application in which several processors
cooperate to execute a job specified in the language of the standard UNIX make
utility. This introduction to parallel make assumes some familiarity with
standard make, and with the execution environment of ISIS applications.
A pmake job is modeled as a directed acyclic graph (DAG) in which nodes
represent steps in the job, and edges represent data dependencies between
steps. Each explicit rule in pmake has the form
\begin{verbatim}
{target}* : {dependency}*
{ shell command }*
\end{verbatim}
The commands specified in a rule are executed sequentially as one step in the
pmake job. The same is true for each instantiation of an implicit rule.
For example, consider the following make file text:
\begin{verbatim}
product: file1 file2
file1.o : file1.c file1.h
cc -c file1.c
file1 : file1.o
rm file1
cc -o file1 file1.o
file2 : file2.c file1.o
rm file2
cc -o file2 file2.c file1.o
\end{verbatim}
product, file1.o, file1 and file2 appear as targets, file1.c, file2.c file1.h
and file1.o appear as dependencies, and cc and rm shell commands are specified
in the rules. Since the rm and cc commands under file1 are specified in the
same rule they are recognized as a single step. Parallel make will always
execute them in sequence. The same is true under file2.
Like make, parallel make does not require that all dependency and target names
be files. Symbolic dependencies ( such as ``product'' in the example ) may
appear which do not represent specific files. It is VERY important that ALL
dependencies be carefully specified in the make file(s). Unlike serial make,
parallel make will feel free to execute whole steps in any order allowed by the
dependencies.
In the example, standard make would build file1 followed by file2 because that
is the order in which steps are listed in the ``product'' rule. Parallel make
might build file1.o, followed by file2, and end with file1 as that is allowed
by the dependencies specified.
\subsection*{Parallel vs. Serial Specifications}
Parallel make executes steps in 2 phases, a serial phase followed by a
parallel phase. During the serial phase pmake's activity is exactly like
standard make's activity, with one exception. When pmake is about to execute
a step in the make file it first checks to see whether the user specified that
it be executed during the parallel phase. If so, a description of the step
and its dependencies is enqueued in a graph file for later execution.
Otherwise the step is considered to be a serial step, and is executed
immediately. Thus pmake's capabilities are a superset of make's capabilities.
The user specifies a parallel rule by preceding the first command in the rule
with a {\tt |} character. Any rule without a leading | is executed during the serial
phase of pmake. Rules with leading | characters are accumulated in a graph
file and executed after all of the serial steps in the entire job are complete.
This means that even though specifications of serial and parallel steps may
be interleaved in the make file, all of the serial steps are executed before
any of the parallel ones. All of the parallel steps are executed by one
invocation of pmkexec, the parallel execution manager.
Pmake insures correct ordering of steps executed during parallel phase using
dependencies accumulated during the serial phase. The default implicit
rules do not begin with | characters. Users should either avoid or modify
implicit rules.
In the following example, the rm commands are separated from the cc commands
using symbolic dependencies. The rm commands are not preceded by | symbols, so
they are executed during serial phase as they are encountered. The cc commands
are accumulated and placed in a graph file for later parallel execution.
\begin{verbatim}
product: file1 file2
file1.o : file1.c file1.h
|cc -c file1.c
rmfile1:
rm file1
rmfile2:
rm file2
file1 : file1.o rmfile1
|cc -o file1 file1.o
file2 : file2.c file1.o rmfile2
|cc -o file2 file2.c file1.o
\end{verbatim}
The distinction between serial and parallel steps must be carefully considered
when devising a parallel make specification. For example, in instances where
one pmake includes another as one of its steps, a completely independent
invocation occurs. If the imbedded pmake command used default options the
subordinate would execute its own parallel phase before returning.
It is also possible to accumulate job steps from multiple make files for
execution during one parallel phase. Since pmake generates the graph file in
append mode, steps accumulate until the graph file is purged. Normally pmake
purges its graph file after executing the parallel phase. The -A option can
be used to suppress execution of the parallel phase and the purging of the
graph file. Subordinate pmakes which use the -A option accumulate job steps
for execution during parallel phase, then just leave them in the graph file.
This allows other pmake instances to continue accumulating steps in the same
graph file for execution during ITS parallel phase. In this case the data
dependencies identified by various subordinates must be sufficient to force
correct execution order.
In the following example three make files, one of which invokes the other two,
accomplish the same result as the earlier examples.
dir/makefile contains the following:
\begin{verbatim}
product: file1 file2
rmfiles:
rm file1
rm file2
rm product.gph
file1 : rmfiles
cd subdir1; pmake -G ../product.gph -A
file2 : rmfiles
cd subdir2; pmake -G ../product.gph -A
dir/subdir1/makefile contains this:
\end{verbatim}
\begin{verbatim}
file1:
|cc -o file1 file1.o
file1.o : file1.c file1.h
|cc -c file1.c
\end{verbatim}
and dir/subdir2/makefile contains this:
\begin{verbatim}
file2:
|cc -o file2 file2.c ../subdir1/file1.o
\end{verbatim}
This pmake is invoked with the command:
pmake -G product.gph
The use of the -G and -A options in the pmake command cause output from the two
submakes to be appended to the same graph file, product.gph. The rm commands,
including one to remove the old product.gph, are specified serially, so they
execute during the serial pass instead of being queued on product.gph. Because
the submakes are accumulated in the same graph file, the file and symbolic
names contained in them are resolved together, allowing them to reference each
other. Parallel make can interleave the execution of commands from the
submakes constrained only by the specified dependencies.
All dependency and target names are evaluated in UNIX style, even if they are
symbolic. Thus 2 names which yield the same path ( taking current directories
into account where appropriate ) will be treated as the same object by parallel
make even if they arose from different sub-makes. In the example parallel make
realizes that the reference to ../subdir1/file1.o refers to the same object as
file1.o referenced from the other submake. Parallel make does not account for
synonyms such as those which result from the ln command, so be sure to refer to
such objects in a consistent manner.
Be careful of any conditionals such as ``if'' or ``for'' statements used in serial
rules which depend on data created or massaged by parallel rules. Keep in mind
that we are pulling off a major con job on serial make by deferring execution
of its steps.
\subsection*{Parallel Make Flow}
The serial phase is executed directly by the command {\tt pmake} on make files
which are compatible with standard make with the exceptions mentioned above.
During the serial phase of parallel make, steps to be executed in the parallel
phase are accumulated in a graph file. The user may specify the name of the
graph file using the -G option in the pmake command. To cause parallel
execution of the commands stored in the graph pmake automatically executes the
command:
pmkexec product.gph pmk 0 6
where:
\begin{itemize}
\item product.gph is the graph file (pmake.gph is the default)
\item pmk is the name of the ISIS group to be formed for execution
\item 0 is the ISIS port number associated with the site view to be used
\item 6 is the maximum number of workstations to be used at any one point
during the job.
\end{itemize}
Pmkexec reads the graph file and passes it to the prescheduler which
evaluates and allocates steps to servers in a hopefully reasonable manner.
After that the ISIS process group which will execute the job is formed. In the
event that execution does not progress as the prescheduler thought it would
steps are dynamically reallocated ( fought over ) by group members.
The identity of the servers assigned to the job is specified in the current
ISIS site view. Parallel make uses only servers which have no users as
process group members other than the server from which the job was begun.
The user must be sure ISIS is running on the servers before running pmake.
Pmake uses r\_exec to start server processes on remote sites.
The standard output from steps executed by each server is accumulated in a
separate results file. These are concatenated when the job is completed to
present a complete but artificially ordered history of the job in files in the
directory that was current when pmkexec was run. Pmake prints job output to
stdout upon completion of the job. Remember that pmake is not robust in the
ISIS sense, but pmkexec is.
\subsection*{How to Run a Parallel Make Demo}
\begin{enumerate}
\item Load and make the ISIS system including the demos directory.
\item cd to ~isis/SUN/demos/pmk, the object directory for parallel make.
\item If they don't already exist, copy qtest*.c and makeqtest from
~isis/demos/pmk.
\item If qtest exists, type the command:
{\tt make -f makeqtest clean}
to remove the old version
\item Be sure ISIS is started on the server you are using. This demo uses up
to 5 additional servers if ISIS is running on them and they have no
other users.
\item type the command:
{\tt ../../bin/pmake -f makeqtest -P}
where {\tt} is the ISIS port number.
This causes the following output when run with only 1 available server:
\end{enumerate}
\begin{verbatim}
isis% ../../bin/pmake -f makeqtest -P 1551
SERIAL PHASE
|cc -c qtest1.c
|cc -c qtest2.c
|cc -c qtest3.c
|cc -c qtest4.c
|cc -c qtest5.c
|cc -o qtest qtest1.o qtest2.o qtest3.o qtest4.o qtest5.o
|qtest
PARALLEL PHASE
RUNNING /usr/u/isis/SUN/bin/pmkexec pmake.gph pmk 1551 6
RESULTS:
**********************************************************************
STEP 1 : cd /usr/fsys/odin/c/b/isis/SUN/demos/pmk; cc -c qtest2.c;
**********************************************************************
STEP 2 : cd /usr/fsys/odin/c/b/isis/SUN/demos/pmk; cc -c qtest3.c;
**********************************************************************
STEP 3 : cd /usr/fsys/odin/c/b/isis/SUN/demos/pmk; cc -c qtest4.c;
**********************************************************************
STEP 0 : cd /usr/fsys/odin/c/b/isis/SUN/demos/pmk; cc -c qtest1.c;
**********************************************************************
STEP 4 : cd /usr/fsys/odin/c/b/isis/SUN/demos/pmk; cc -c qtest5.c;
**********************************************************************
STEP 5 : cd /usr/fsys/odin/c/b/isis/SUN/demos/pmk;
cc -o qtest qtest1.o qtest2.o qtest3.o qtest4.o qtest5.o;
**********************************************************************
STEP 6 : cd /usr/fsys/odin/c/b/isis/SUN/demos/pmk; qtest;
1 2 3 4 5 5 4 3 2 1
\end{verbatim}
The order of the printed results changes depending on the number of servers.
The command associated with each step number should always be the same as that
shown here.
\subsection*{How to Run Parallel Make With Your Own Makefiles}
\begin{itemize}
\item Locate the absolute files pmake and pmkexec and make pmake accessible to
the user and/or directory(s) where you will be running pmake. Both of
these files must be in the same directory, i.e. ~isis/SUN/bin.
\item Modify or generate your makefile to indicate which steps should be done
in parallel. Remember:
- pmake requires that dependencies be separated from commands by a
$<$cr$>$ $<$tab$>$ sequence. A ``;'' confuses it.
- The precious option does nothing
- Parallel steps begin with a ``{\tt |}''
\item Remove the old graph file if it exists.
\item type the pmake command incorporating the following options in addition to
those supported by standard make:
\begin{description}
\item[-G] graph file name. Default: pmake.gph.
\item[-P] ISIS port number. Default: 0.
\item[-S] number of servers. Default: 6.
A negative port number generates lots of messages indicating
pmkexec's progress.
\item[-K] keep the report files generated by pmake described below.
Default: Destroy files at end of job.
\item[-A] accumulate parallel steps in the graph file without executing them.
Default: execute at end of pmake.
\end{description}
\item Pmake will print a report including the commands executed and their
standard output when execution is complete.
\item If the -K option was requested, the following files will persist:
\begin{itemize}
\item the graph file containing a description of the parallel steps executed
in a pmake specific format described below.
\item a copy of the graph file annotated with step execution and server
queue and speed information. The name of this file is the name of the
graph file with a ``r'' appended to it.
\item a map of way the job was actually executed indicating which steps
were executed by which servers in what order. Step numbering is
generated by pmake and can be converted into command contents using
the graph file. The name of this file is the name of the graph file
with ``r.sum'' appended to it.
\item a file for each server containing the standard output it generated.
The names of these files are the name of the graph file with ``.stdoX''
appended to it, where X is the server number. Servers are numbered
beginning at 0 by pmake.
\end{itemize}
\item You can rerun exactly the same sequence of parallel steps just completed
using the command:
pmkexec file group port servers
where:
\begin{itemize}
\item file = graph file name
\item group = ISIS process group name
\item port = ISIS port
\item server = number of servers
\end{itemize}
You can also add servers to a job manually by logging into the desired
server and typing the command
pmkexec group port servers
These are the commands used by pmake and rexec to execute the parallel
part of a parallel make job.
\item When making pmake, use the C preprocessor variable ABSDIR in the cc
commands for main.c and pmkexec.c to indicate the directory in which
pmkexec will reside. The value of ABSDIR must be a string ( including
double quotes ) containing the complete path name to the directory.
\end{itemize}
\subsection*{A Larger Example}
The pmakefiles necessary to make ISIS are included in the ~isis/demos/pmk
account. In order to pmake ISIS these files must be placed in the ISIS object
directory hierarchy rooted at some directory, say, isis\_home as follows:
\begin{verbatim}
pmk directory file copy to:
pmake.isis isis_home/makefile
pmake.protos isis_home/protos/makefile
pmake.clib isis_home/clib/makefile
pmake.mlib isis_home/mlib/makefile
pmake.util isis_home/util/makefile
pmake.demos isis_home/demos/makefile
\end{verbatim}
Pmake can then be run on the ISIS code from isis\_home.
\subsection*{Internal Graph File Format}
Pmkexec receives descriptions of parallel steps in a textual ``graph file'' in
which every line contains a 1 to 6 character keyword beginning in its first
column. The keyword determines the meaning of the rest of the line.
All communication between pmake and pmkexec, as well as the initial state
specifications for new server processes use this format. In addition, one or
all servers can be forced to dump thier states into a file of this format
using the send command in the ISIS cmd utility. A single parameter indicating
the name of the file in which the dump should be placed is required. Each
server will append its pmake server number to the file name specified to
yield the name of the dump file to be generated. Note that the consistency of
these dumps depends on the type of broadcast used by the send command.
Data items and process steps are defined by lines or sequences of lines in
this file. Data items and steps are numbered for reference by pmkexec in the
order in which they appear in the graph file. The following line formats may
appear:
\begin{description}
\item[NEW]
Indicates the beginning of the graph
\item[SIZE] steps items servers
Indicates the number of job steps, data items, and server data
structures to be allocated for the job description. This information
is not essential but its presence allows faster processing of the graph
file.
\item[IN] name mod size type
Indicates an input dependency. - Also used to declare a data item for
future reference as an input or output dependency.
\begin{description}
\item[name] the name of the file
\item[mod] - indicates whether or not the file is out of date or has been
modified ( ``U'' for unmodified, ``M'' for modified ),
\item[size] - size of the file in bytes
\item[type] - numeric file type indicator - NOT USED
\end{description}
\item[OUT] name mod size type
Indicates an output file generated during execution of a step. A given
data item may only be generated by one step. If an OUT line appears
outside of a step definition, subsequent IN lines ( up to the next
OUT or STEP line) designate input dependencies to be applied to the
step which generates the OUT file. This allows dependencies to be
specified before the command for generating a file is known.
Parameters same as those of IN.
\item[STEP] content
Begins definition of a step for execution. Also declares that the step
is in a particular processors queue if it lies between a PROC line
and an END line.
content - The bourne shell command which constitutes the step.
\item[TIME] milliseconds
Indicates the expected duration of the step whose definition is in
progress.
milliseconds - the expected duration in milliseconds.
\item[CALCS] ctime dtime mod status dep
A set of calculations generated by pmkexec.
\begin{description}
\item[ctime] the length of the critical path from this step to the end of
the job
\item[dtime] The total compute time of steps which depend on this one.
\item[mod] ``U'' for no recomputation necessary, ``M'' for recompute.
\item[status] The completion status of the step:
\begin{enumerate}
\item RAW - CALCS not yet computed
\item EVAL - CALCS computed, but step not allocated to processor
\item ALLOCATED - Step allocated to server but dependencies not
fulfilled
\item SCHEDULED - Step allocated and ready to execute,
or executing
\item SCHED\_LOCK - Step allocated and ready but being transferred
or about to change status
\item DONE - Step execution complete.
\end{enumerate}
\item[dep] The number of input dependencies which have not yet been fulfilled
\end{description}
\item[RSLT] start stop exp\_stop deny
Statistics resulting from executing or scheduling a step. All times
in milliseconds.
\begin{description}
\item[start] the time at which execution was started
\item[stop] the time at which execution ended
\item[exp\_stop] The time at which execution is expected to end based on
simulated execution - NOT USED
\item[deny] The time at which a request for the step was last denied.
\end{description}
These times vary from one server to another because their clocks are not
synchronized in any way. The one time on which all servers should agree
is (stop - start), as the messages actually exchanged between servers
contain this quantity as reported by the executing server.
\item[ENV] env\_string
A UNIX environment variable assignment to be installed before executing
a step.
env\_string - Specification of a variable and its values in the form:
variable=value
\item[IN\#] item
Specifies that a previously encountered data item is an input to a step
item - The internally generated number of the data item
\item[OUT\#] item
A previously encountered data item is output by a step.
item - The number of the data item.
\item[STDO] file
NOT USED
\item[SPEED] speed\_ratio
Indicates the measured speed of the execution of a step, defined to be
the actual execution time of the step divided by the expected execution
time. Also used to indicate the average of the speeds of all steps
executed by a given processor.
speed\_ratio - ((stop - start)/TIME)
\item[DUMDUR] milliseconds
Indicates that execution of a step is to be simulated using a unix sleep
command of (milliseconds/1000) seconds instead of actually executing the
command associated with the step.
\item[END]
Indicates that the definition of a step is complete
\item[PROC] name
Indicates the beginning of the definition of a server. Definition ends
when another PROC command is encountered. Any steps defined after a
PROC are in that server's queue.
name - The ascii name of the server
\item[RANK] num
Indicates the current rank of a server in an ISIS process group view
num - the rank, assigned by ISIS.
\item[STEP\#] num
Indicates that a previously encountered step is in a server's queue
num - the number of the step
\item[DONE]
Indicates that the all preceding steps in a processor's queue have been
completely executed, and that any subsequent steps have not.
\item[DUMSPD] speed\_ratio
Indicates the speed at which any SIMULATED steps on this processor
should be executed.
speed\_ratio - factor by which to multiply a step's DUMDUR to obtain
the length of the sleep command
\end{description}
\subsection*{PMAKE Source Files}
\begin{description}
\item[Makefile] the standard unix make file for making pmake and pmkexec
\item[check.c] a module of the public domain make utility not modified for pmake
\item[input.c] unmodified public domain make module
\item[macro.c] unmodified public domain make module
\item[pmain.c] the pmake main program - modified public domain module
\item[make.c] the pmake dependency interpreter - modified public domain module
\item[make.h] include for pmake - modified public domain module
\item[makeqtest] pmake file for quick test
\item[pmake.*] pmake files for building ISIS
\item[pmk.doc] this document
\item[pmkdat.h] data structures for pmkexec
\item[pmkexec.c] parallel execution module---encapsulates all ISIS interaction
\item[pmkgph.c] DAG data structure manipulation
\item[pmkio.c] graph file input and output
\item[pmklib.c] routines converting make command lines and their dependencies into
graph file entries. Performs execution time estimates.
\item[pmklst.c] linked lisk manipulation
\item[pmksched6.c] sixth generation scheduling decision maker
\item[qtest*.c] quick pmake demo/test
\item[reader.c] unmodified public domain make module
\item[rules.c] unmodified public domain make module
\end{description}
escription}
\item[start] the time at which execution was started
\item[stop] the time at which execution ended
\item[exp\_stop] The time at which execution is expected to end based on
simulated execution - NOT USED
\item[deny] The time at which a request for the step was last denied.
\end{description}
These times vary from one server to another because their clocks are not
synchronized in any way. The one time odistexec.tex 666 1273 212 111020 4714336137 6550 \section{Parallel and replicated execution}
Process groups provide a natural way to distribute a computational
task over a set of processes.
We saw one way to do this in the chapter on replicated data.
Using the Linda S/Net tuple space, one could
build an application in which the members of a process
group compete for pieces of work to do, with the items of
work being represented by tuples in the shared tuple memory.
The Linda style of distributed computing is a good (and recommended)
approach for problems in which a set of identical servers
are trying to exploit parallelism to speed up the processing of
some computational task.
However, as noted below, the Linda style uses several {\tt bcast}
operations. For some tasks, the number of Linda tuple
operations needed, and hence the {\tt bcast} cost, becomes high enough so that
other approaches are needed.
For other tasks, the Linda style simply doesn't match very well with the
computational goal.
In this section we describe a variety of reasons one might wish to subdivide a
computation, giving ISIS solutions for each case.
The first part of this chapter focuses on ways of
dividing a computation up among the members of a process group.
We describe three approaches: one in which all members of a process
group redundantly undertake the same computational task,
one that appoints a single member as the {\em coordinator} to oversee the
computation, backed up by {\em cohorts} that take over to
complete the computation if the coordinator fails, and a
third approach in which the work is actually divided up
among a set of processes, each of which becomes responsible for some
part of it.
The approaches are presented roughly in order of complexity: a redundant computation
is easy to construct but makes inefficient use of CPU resources.
A coordinator-cohort computation eliminates this inefficiency but in a way that
fails to exploit the potential parallelism of a process group unless the
group is receiving large numbers of requests concurrently.
A subdivided computation, on the other hand, is tricky to code in a fault-tolerant
way. The Linda approach can often be used instead of a subdivided
computation in cases where the computation will
run for a ``reasonably long time'' relative to the time needed to
do a {\tt bcast}. The reason for phrasing the choice in terms of this
cost comparison is that the Linda style of computing requires more
{\tt bcast} operations than the subdivision scheme illustrated in this chapter,
and hence might perform poorly if the computation is so quick that
the cost of an extra {\tt bcast} begins to impact on
time needed to complete it.
The second part of this chapter looks at how replicated data can be
managed in the context of this sort of computation.
We focus on cases that are non-transactional.
Readers who are using transactional facilities
external to ISIS, or need to obtain transactions within ISIS, should
refer to the chapter on transactions.
The last part of the chapter addresses the problem of
aborting a computation while it is underway.
\section{Redundant computation}
The simplest kind of distributed execution
\index{redundant computation}\index{distributed computation, redundant}
arises when one wishes to have several
processes undertake the same task (although perhaps on
different data), with the caller waiting for the
first response and continuing to execute without waiting for
responses from the others.
For example, a process in need of some resource might ask
a set of servers managing identical databases if any can provide it,
taking the first answer.
It might seem as if the
only reason for adopting this approach is its fault-tolerance,
because it wastes CPU cycles.
However, the same style of interaction would make sense in
the case of a database partitioned among several servers:
the servers could be asked to search for some object in parallel,
with the caller continuing as soon as any server locates the object.
Here, the servers are really doing different parts of the work,
thus exploiting the parallelism inherent in a process group.
Another reason for using a replicated update arises
when a set of servers are asked to update
copies of some replicated data item.
Since all will
arrive at the same state after doing the update, the caller who needs some
response from the server based on the outcome of the update need only
wait until it knows the result of one of these updates before continuing.
It is easy to implement a redundant computation using ISIS.
The request should be broadcast (using {\tt bcast}) to the process group
that will execute it, requesting 1 reply ({\tt nwant = 1}).
Recipients undertake their computational tasks, and then
reply. A participant whose computation,
say a search, terminates unsuccessfully uses
{\tt nullreply} to indicate this negative outcome.
\begin{figure}
\makebox[457pt][l]{
\hspace{-186pt}
\vbox to 286pt{
\vfill
\special{psfile=kb90-006.ps}
}
}
\center{Caller does bcast, resumes when first reply arrives}
\vspace{1in}
\caption{A redundant computation.}
\end{figure}
The caller will remain blocked until the first reply arrives, at which
point it can continue computation (see Figure 8.1).
The following example illustrates a redundant computation.
\begin{verbatim}
#include
address *rmgr; /* Address of the resource manager */
/* Ask a resource manager to look up a resource. */
address get_raddr(name, passwd)
char *name, *passwd;
{
address where;
if(bcast(rmgr, SEARCH, "name=%s,passwd=%s", name, passwd,
1, "addr=%a", &where) != 1)
... error case ...
return(where);
}
\end{verbatim}
For its part, the manager would execute the following code:
\begin{verbatim}
/*
* Resource manager: search a list of "items"
* for one that matches the search request
*/
search_req(mp)
message *mp;
{
address where;
char *name, *passwd;
msg_get(mp, "name=%s,passwd=%s", &name, &passwd);
for(n = 0; n < NITEMS; n++)
if(match(&services[n], name, passwd))
{
reply(mp, "addr=%A[1]", &services.it_addr);
return;
}
nullreply(mp);
}
\end{verbatim}
\section{Coordinator-cohort computation}
The coordinator-cohort scheme is popular within ISIS
\index{coordinator-cohort computation}\index{distributed computation, coordinator-cohort}
because it provides inexpensive
fault-tolerance for simple computations that can be performed by all
members of a process group.
The method works by ranking the members of a process group according to
a simple rule and then
labeling the lowest ranking member as the {\em coordinator} for a request:
it will execute the request and {\tt cbcast} a reply to the caller.
The other members are {\em cohorts}: they are passive unless a failure
prevents the coordinator from terminating normally, in which case they
take over one by one, in rank order.
The coordinator is also able to send the cohorts
a copy of the answer returned to the caller
(and possibly additional information) at the termination of the computation.
For example, the coordinator can compute a set of updates to data structures
and then inform the cohorts of these updates as part of its reply message,
or in a setting where the outcome of each request is logged, it could
just send the reply for the purpose of log entry.
\subsection*{The mechanics of a coordinator-cohort computation}
A coordinator-cohort computation must be started up, in parallel, by all
members of the process group that will run it.
This has some subtle implications to which we return below.
The easiest case is the one where a broadcast message arrives, because
all the members will receive it concurrently.
In this case, they should all execute the following subroutine call:
\begin{verbatim}
coord_cohort(msg, gaddr, action, got_reply, arg)
message *msg;
address *gaddr;
int (*action)(), (*got_reply)();
char *arg;
\end{verbatim}
The arguments are the message that triggered the action,
the address of the process group doing the
computation, a
subroutine that will take the desired action on behalf of the group, and
a subroutine that will be called in the cohorts when the action finishes.
A coordinator-cohort computation can also be run in other situations, say if
all the members of a process group observe some other event such as a failure.
The critical word here is {\em all}: the {\tt coord\_cohort} tool assumes that
all members of {\tt gaddr} will call it, and that the calls will all be done in
the same {\em group view} of the group to which {\tt gaddr} corresponds.
The message pointer should be null in cases where the computation is being
started for some reason other than reception of a request message.
The tool operates by running a loop.
On each iteration, it picks a coordinator and invokes its
action routine as: {\tt action(msg,gaddr,how,arg)}.
The arguments are the message pointer, which could be null if the tool was
invoked with a null message pointer,
a pointer to the group-address from the {\tt coord\_cohort} call, and
an indication of whether this action is running in the original
coordinator or a cohort that took over because the original
coordinator failed.
The corresponding values of {\tt how} are {\tt ORIGINAL} and {\tt TAKEOVER},
respectively.
If the coordinator fails before terminating (as described below),
the loop iterates, picking another coordinator and invoking its
action routine, etc.
The coordinator takes the desired action, and then
terminates by doing one of two things.
\begin{enumerate}
\item[1.]
If the operation was triggered by reception of a {\tt bcast} message,
the coordinator just replies to it.
If the caller is waiting for responses at all, it
would normally specify that it needs one reply.
The reply message will be transmitted using {\tt cbcast} to
the other members of the process group, where their {\tt got\_reply}
routines are invoked with the reply message as an argument.
Thus, either everyone gets the reply or nobody does.
{\em Note:} if all members of the group fail, or some software bug causes the
coordinator to crash at one member after another,
the caller will eventually wake up with {\tt 0} replies.
\item[2.]
A caller can initiate an asynchronous coordinator-cohort computation,
and in this case it would be inappropriate to terminate the
computation by sending a {\tt reply}.
In addition, a computation could start
for some reason other than reception of a message.
In these cases, the coordinator should call the routine
\begin{verbatim}
cc_terminate(fmt, args...)
char *fmt;
\end{verbatim}
{\tt cc\_terminate} functions by generating a message and sending it to the
{\tt got\_reply} routines in the cohorts.
\end{enumerate}
One special case arises when the coordinator wants to send some message
to an additional destination {\em as part} of its termination protocol.
This is done using the interface
\begin{verbatim}
cc_terminate_l(addr_p, entry, fmt, args...)
char *fmt;
\end{verbatim}
\index{{\tt cc\_terminate}}
\index{{\tt cc\_terminate\l}}
The value of this routine is that the message and the termination
are sent atomically, ensuring that the message will be sent exactly once.
Were the application to first send a {\tt bcast} to the address
and then do the {\tt cc\_terminate}, the cohorts couldn't tell if
coordinator failure had occured before
sending the message or after doing so, and would have to risk sending
it twice or explicitly inquire about this.
The most obvious use of {\tt cc\_terminate\_l} is thus in a situation
where the entire role of the coordinator is to send some message to
a third-party.
The address may be a group or a process, but if it is a group, the
coordinator must be a member of that group.
No restriction applies if the address specifies a single process.
\subsection*{How the coordinator is selected}
The coordinator
\index{coordinator-cohort computation, coordinator selection rule}
for a given request is picked according to the following rule.
If the process group contains a member at the same site where the
message originated, the lowest ranking such member will be the first
coordinator.
Otherwise, the coordinator will be the member with rank
{\centering {\tt sender\_site \em mod \tt pg\_nmembers}}.
The second part of the rule is intended
to distribute the responsibility around the group when
requests come from random sites.
Cohorts take over from failed coordinators in increasing rank order, cycling through the
list of members if repeated failures occur.
\subsection*{Processes that join while the computation is running}
\index{coordinator-cohort computation, need to inhibit joins during}
ISIS does not add extra cohorts to a coordinator-cohort algorithm
when it is already running.
This may or may not pose a problem for the ISIS programmer.
If the coordinator-cohort scheme is used just for fault-tolerance,
it should not matter if the new group members are not included; the
computation will just be a bit less fault-tolerant than might
otherwise have been the case.
More serious is the case where a computation changes the state
of the process group, updating replicated data or otherwise
doing something that the new member might need to know about.
If the updates are transmitted using group broadcasts, the new member will
receive them, and there should be no problem.
On the other hand, it is tempting to use the reply message to simultaneously
carry the reply to the caller and the result of the coordinator's
computation to the other group members -- especially because the
reply is sent atomically, hence either all receive this broadcast or none
do so.
Here, the new member would not receive this reply because it is not
a cohort in the original computation (and even if it did, it might not know
what to make of it, since it will not have received the original request!).
To address this problem, ISIS provides a mechanism for inhibiting new
members from joining a group.
To inhibit joins, the routine {\tt pg\_join\_inhibit(1)} should be
called.
\index{\tt pg\_join\_inhibit}
Joins will be delayed until {\tt pg\_join\_inhibit(0)} is called.
Join inhibits, if used, should be done in parallel by all members of a
process group -- for example, right before starting a coordinator-cohort
computation.
Since no new member will manage to join while the computation is underway,
the reply will now be sent to all members of the group, which will be
some subset of the original cohort set.
The inhibit routine counts the number of times it is called and
can thus be called multiple times by multiple, concurrent coordinator-cohort
computations.
No mechanism is provided for ensuring fairness (namely, that joins will
not be starved forever).
As a convenience, the inhibit mechanism will be activated automatically
by ISIS for certain entry points, if you ask it to.
\index{\tt isis\_inhibit\_joins}
To do this, call:
\begin{verbatim}
isis_inhibit_joins(entry-no);
\end{verbatim}
during startup. ISIS will then inhibit joins at all times
when a task is active processing a message to this entry point.
\subsection*{An example of the coordinator-cohort approach}
The following example illustrates a coordinator-cohort approach to a
DFFT computation.
Such a computation arises in signal processing:
given a vector of data pointers representing a signal sampled at regular time intervals
and a list of frequencies at which to compute spectral estimates, a DFFT
computation will compute and return the power of the signal at each of those
frequencies.
The computation can be costly if high accuracy is required, hence it would
make sense to run it on fast processors or processors with special
vector multiply hardware.
We first give code for the client process:
\begin{verbatim}
/*
* Ask a distributed service to calculate the power of
* a signal at a set of frequencies.
*/
address *svc; /* Address of service */
dfft(signal, freqs, power)
double signal[10240], freqs[56], power[56];
{
if(bcast(svc, DFFT, "signal=%F,freqs=%F",
signal, 10240, freqs, 56, 1, "power=%F", &power) <= 0)
return(ERROR);
return(SUCCESS);
}
\end{verbatim}
The code isn't much more complicated on the server side:
\begin{verbatim}
address *svc; /* Address of this service */
/*
* On receiving the actual request,
* just invoke the coordinator-cohort tool
*/
dfft_req(mp)
message *mp;
{
int do_dfft(), ignore_reply();
coord_cohort(mp, svc, do_dfft, ignore_reply);
}
/* The coordinator computes the actual DFFT solution */
do_dfft(mp)
message *mp;
{
double *signal, *freqs, power[MAX_FREQS];
int ns, nf, a;
msg_get(mp, "signal=%-F,freqs=%-F", &signal, &ns, &freqs, &nf);
for(a = 0; a < ns; a++)
power[a] = compute_dfft(signal, ns, freqs[a]);
reply(mp, "power=%F", power, a);
}
/*
* Although the cohorts aren't interested in the outcome,
* this routine is needed anyhow
*/
ignore_reply(mp)
message *mp;
{
}
\end{verbatim}
\subsection*{Coordinator-cohort advantages}
How would one know whether or not to use a coordinator-cohort solution to
\index{coordinator-cohort computation, deciding when to use}
a problem?
The ISIS coordinator-cohort scheme is interesting because it
provides two particularly powerful guarantees:
\begin{description}
\item[load balancing] If several requests arrive concurrently, the
job of being coordinator will be split over the members of the group in
a uniform manner.
Thus, a single process may be coordinator for one or two requests while
being cohort for several others.
Moreover, there may be several coordinators at one time, for different
requests.
The scheme thus exploits the distributed
processing power of the group in a fault-tolerant way.
\item[atomicity] If a coordinator fails before replying, exactly one
cohort will take over, always.
If the coordinator replies before failing, and the reply is delivered,
no cohort will take over -- the situation will be the same as if the
request had finished long before the failure.
Thus, a coordinator-cohort scheme gives a form of {\em exactly once}
execution, unless all cohorts fail.
If all cohorts do fail, the caller will see this -- but will be
unable to determine, of course, how far the computation had progressed
before the failure.
\end{description}
Thus, the essential issue relates to whether these properties can be
useful in a given setting.
Notice that the code is just as simple as for a redundant computation.
On the other hand, a coordinator-cohort
computation only exploits the potential parallelism of the process group when
several requests happen to arrive simultaneously from random sites.
\section{Subdivided computations}
Both schemes described above treated all processes as equivalent:
the caller requires just a single reply, which
can come from any of the participant processes.
A more stringent requirement applies in situations where work
is being divided among a set of processes, each of
which will contribute to the final result.
For example, one might sort a large file by requesting that
each of $N$ processes sort chunks containing $1/N${\em 'th} of the
file, and then merging the sorted subsequences.
In a more mathematical example, one might ask each of a set of
processes to compute a spectral moment from a signal for some
specified frequency (a costly computation), combining the results to
obtain a frequency template characterizing the power of the
signal at this set of frequencies.
In both cases, the distributed computation is used to exploit the CPU
power of the servers as a set.
\subsection*{The basic approach}
\index{subdivided computation, basic approaches}\index{distributed computation}
The basic approach, not surprisingly, is similar to the previous ones.
The client sends out the request using {\tt bcast} and then waits for
replies -- {\tt ALL} replies in this case, unless it knows the
actual value of {\em N}.
The various recipients then divide up the work using some deterministic
rule (we will return to this issue momentarily).
\subsection*{Making the approach fault-tolerant}
The scheme as described above is not
fault-tolerant.
Since the caller needs {\em all} the replies to assemble a single
complete answer,
if one participant should fail the remaining {\em N-1} sub-solutions
may not constitute a single complete solution to the problem.
To make matters worse, ISIS encourages its users to program
applications in a manner that is oblivious to the actual number
of processes that make up a service, in order to facilitate dynamic
reconfiguration.
Thus, a typical caller of a ``sorting service''
who asks for {\tt ALL} replies will need to be able to determine that it
received $N$, the number the service set out to compute, and no fewer.
Fortunately, this problem is easy to solve.
When a user invokes a broadcast primitive in ISIS, the system
starts by deciding what processes to transmit the message to.
The number of destinations will be placed in the global variable
{\tt isis\_nsent}.
The number of replies received is returned by the broadcast primitive,
and for symmetry is also stored in the global variable {\tt isis\_nreplies}.
Thus, in replicated executions it is probably
an error if ${\tt isis\_nsent} < {\tt nwant}$ or if ${\tt isis\_nreplies} < {\tt isis\_nsent}$.
The application can check for these cases and react, for example by
reissuing the request or by treating this as an exception.
Alternatively, the service can send back the value of {\em N} as part of
the reply.
Before we consider the problem of dividing work among several processes, one
more observation about failure handling in this case should be made.
Observe that it may not always be safe to request re-execution of just
the part of the answer that is missing, even assuming that there is an easy
way to figure out which part of the answer was lost in a failure.
In general, if the failure will not have affected the {\em manner} in
which the answer was computed, the missing piece can be recomputed
independent of the remainder of the computation.
Recall, however, that ISIS supports a notion of the
{\em distributed configuration}
of a system, which can change dynamically in response to failures,
recoveries, and load balancing operations, or even replicated updates.
In situations where the computation may have been ``configuration sensitive'',
there will generally be no choice but to reissue the entire request.
\subsection*{Subdividing the work}
There are many ways to divide the work among a set of process group
\index{subdivided computation, use of configuration in}
members.
The easiest to code are the ones that use some simple deterministic
algorithm based on the current process group view (see {\tt pg\_getview})
and perhaps the address of the sender of the request message (see {\tt
msg\_getsender}).
Much fancier schemes can, however, be devised.
These typically use a configuration data structure which all
processes examine when deciding how to process a given request.
Such structures were introduced in Chapter 9.
Provided that all group members see the same contents when
they look at the structure, the decisions they arrive at will be mutually
consistent.
In the examples below, there is no configuration structure other
than the process group view itself.
\subsection*{An example}
Let us return to the DFFT example, but develop a subdivision strategy under which
each member of the process group does some of the work.
\index{subdivided computation, example}
The idea is that we will use a deterministic rule for dividing the
set of frequencies up, so that each member handles some of the frequencies.
An easy rule, and the one we use, is to base this on the rank of a process
in the process group (which will be a number in the range {\em 0..nmembers-1}) and
the current number of members.
Of course, the client may not know how many members were running when the
request was received, so we will also have the group members
send back an indication of which frequencies they handled.
To avoid making the example unnecessarily complex, the
solution simply iterates a caller's request if an
insufficient number of replies is received.
\begin{verbatim}
/*
* Ask a distributed service to calculate the power of a
* signal at a set of frequencies. Assume that each member
* of the service will handle some subset of the frequences.
* Assume there are MAX_SVCSIZE or fewer members.
*/
address *svc; /* Address of service */
dfft(signal, freqs, power, nf)
double *signal, *freqs, *power;
{
int *which[MAX_SVCSIZE];
double *answ[MAX_SVCSIZE];
register f, n, a;
/* Loop if a crash happens just as we make the request */
svc = pg_lookup("dfft-service");
do
{
if(bcast(svc, DFFT,
"signal=%F,freqs=%F", signal, 10240, freqs, nf,
ALL, "which=%-D,power=%-F", &which, &answ) < 0)
return(ERROR);
}
while(isis_nresp != isis_nsent);
/* Now copy the answers. The idea is that each member of the
* server computed a subset of the answers. They told us which
* ones by returning a list of indices into the frequencies vector,
* terminated by -1. The answers are pointed to by answ[].
*/
for(who = 0; who < isis_nresp; who++)
{
register i;
/* which[who][] tells which frequency each is from */
while(which[who][i] >= 0)
power[which[who][i]] = answ[who][i];
}
return(SUCCESS);
}
\end{verbatim}
Notice that the complexity on the client side comes mostly from the need
to reassemble the answers, since the order of answers in a reply is
unpredictable.
In some situations, it may be easier to sort the answers into some
reasonable order just to avoid this problem.
At any rate, the code for the server looks like this, using the
work subdivision scheme described above:
\begin{verbatim}
/*
* Member of the computational service: compute some DFFT powers
* The members each respond on the basis of their rank in the
* process group. The i'th member handles frequences i, i+NMEMBERS,
* Assume no server will compute more than NFREQS answers.
*/
address *dfft_svc; /* Assume initialized in pg_join */
dfft_req(mp)
message *mp;
{
double *signal, *freqs, answ[MAX_FREQS];
int ns, nf, which[MAX_FREQS];
msg_get(mp, "signal=%-F,freqs=%-F", &signal, &ns, &freqs, &nf);
n = pg_nmembers(dfft_svc);
for(i = pg_rank(&my_address, dfft_svc); i < nf; i += n)
{
which[a] = freqs[i];
answ[a++] = compute_dfft(signal, ns, freqs[i]);
}
reply(mp, "which=%D,power=%F", which, a, power, a);
}
\end{verbatim}
Before turning to other issues, there is one remaining problem that
this solution raises.
Recall that ISIS has a limit on how much stack space any task can use.
Thus, this code could overflow the ISIS stack limit if
{\tt MAX\_FREQS}
were large enough so that {\tt sizeof(answ)+sizeof(which)}
becomes anywhere near that stack limit size, currently 32k.
Since this is not an unlikely event, we should discuss ways to avoid the
problem.
Three alternatives come to mind:
\begin{enumerate}
\item Use {\tt malloc} and {\tt free} to dynamically allocate storage for the
{\tt answ} and {\tt which} arrays.
This is the easiest solution.
\item Assuming that {\tt compute\_dfft} never causes the task to block,
(which would permit other tasks, and perhaps another invocation of {\tt dfft\_req}, to
run),
{\tt answ[]} and {\tt which[]} can be declared as static variables.
On the other hand, if {\tt compute\_dfft} could block, it might get re-entered in which case
a bug would result: the two invocations would put their answers in
the same memory locations.
It seems unlikely that a routine to compute dfft spectral estimates would block,
so this solution is probably satisfactory.
\item Still using static variables, one can serialize any requests that happen to show up
concurrently, which would work even if the computation can block.
For example:
\begin{verbatim}
dfft_req(mp)
message *mp;
{
static dfft_inprogress; /* Set when computing a dfft */
static condition dfft_wait; /* To wait when computation is running */
/* Wait if the computation is already running */
if(dfft_inprogress++ != 0)
(void)t_wait(&dfft_wait);
. . . as before . . .
/* Let one task through, if any is waiting */
if(--dfft_inprogress != 0)
t_sig(&dfft_wait, 0);
}
\end{verbatim}
\end{enumerate}
\section{Updating replicated data in distributed executions}
Replicated data is easy to access and update during a distributed
\index{distributed computation, use of replicated data in}
execution, provided that care is taken when synchronizing with other
computations and obtaining locks.
This section covers some simple, stylized schemes that should be
easy to adapt.
All give comparable performance; hence the scheme to choose is the one
that seems most natural in the context of the distributed system you
are designing.
\subsection*{Memory resident data}
The approach adopted depends on the style of distributed execution
being used.
{\bf Parallel and redundant computations.} When using a scheme in which all participants execute in parallel,
the best strategy is for each participant to do updates locally,
\index{redundant computation, use of replicated data in}
directly on the local copy of the distributed data item.
If locking is necessary, all participants should obtain locks locally
using a deterministic algorithm, then all will block waiting for
a lock if any does, and when the lock is released, all will resume
execution.
The simplest way to do this is only applicable in a totally
redundant computation, because it takes advantage of the fact that all
process group members in such a setting do exactly the same things
in the same order.
Here, one can just have each member use an integer
variable to designate whether or not the lock is held, and a condition
variable on which tasks that need to wait will queue up:
\begin{verbatim}
acquire_lock(lockvar, waitqueue)
int *lockvar;
condition *waitqueue;
{
while(*lockvar)
(void)t_wait(waitqueue);
*lockvar++;
}
release_lock(lockvar, waitqueue)
int *lockvar;
condition *waitqueue;
{
*lockvar--;
t_sig(waitqueue, 0);
}
\end{verbatim}
Chapter 11, which looks at synchronization issues,
gives several ways to obtain mutual exclusion or locking in
less redundant computations.
In such settings the above code obviously won't work, because it
involves purely local synchronization that happens to work
globally because the overall computation happens to be
identical at all process group members.
{\bf Coordinator-cohort computations.}
\index{coordinator-cohort computation, use of replicated data in}
Using a coordinator-cohort scheme, replicated data is a bit
harder to manage because the coordinator is supposed to run
the entire computation, leaving the cohorts passive unless it
fails.
The easy case is the one where no locking or synchronization is
needed.
Here, the only requirement is to ensure that all the updates done
by the coordinator occur atomically, even if it fails.
For example, if the coordinator terminates by doing a single update,
or a small set of updates that can be described in a single message,
the solution is just to send copies of this information to the cohorts,
who can apply the updates locally when {\tt got\_reply} is called at the end of
the computation.
It would normally be necessary to inhibit joins when using this approach.
There are two ways to pass this update information: either to add extra
information to the reply beyond the part the caller will look at,
or to use a separate {\em message field} to store the information the
cohorts will need.
The latter scheme would be useful if the caller is not supposed to
know about the extra information being passed to the cohorts.
This case generalizes to cover situations in which the coordinator
generates a stream of updates while it executes, but still needs no
locking or special synchronization.
The best strategy is to send these updates to the cohorts
using ``lazy'' broadcasts (broadcast option {\tt `z'}), which are transmitted to the cohorts
in a way that minimizes system load.
The cohorts have a choice: they can spool the updates, and apply them
all at once when the computation terminates.
Or, they can apply the updates as they arrive, but keep a ``log''
recording old values of data items, rolling back if a coordinator failure is
observed by replaying the log.
The easiest case of all is unfortunately the least common: some
applications are {\em idempotent}, meaning that the application remains
correct even if its operations are performed more than once.
In this case the cohorts are free to repeat the coordinator's actions
even if some updates done by the coordinator are already recorded
in the data structures managed by the application.
In this case, one can apply updates as they arrive, but dispense with the log
and rollback mechanism.
What if synchronization {\em is} needed?
We describe mechanisms for synchronization in Chapter 11.
Using these, one can obtain mutual exclusion on behalf of the coordinator,
and then continue using any of the methods outlined above,
releasing the mutual exclusion or lock at the end of the
computation.
\section{Disk files}
There is no restriction on using disk files for temporary purposes within
an ISIS client program.
However, problems can arise if the file is intended to be
used after a recovery.
This latter topic is covered in the chapter on transactional issues.
\section{Aborting a distributed execution before it terminates}
In some applications, it is necessary to be able to
abort requests while a distributed execution is being undertaken to
respond to them.
\index{distributed computation, aborting while in progress}
Although a bit complicated, such an effect can be achieved using a polling
scheme under which the cancel message sets a flag that the
computational task will notice next time it looks.
Our scheme uses the event-id that ISIS associates with each
broadcast.
\begin{verbatim}
/* Accept a message to cancel an operation identified as ``opid'' */
event_id cancel_id;
cancel_req(mp)
message *mp;
{
msg_get(mp, "%d", cancel_opid);
}
/* Process a request, checking periodically lest it be canceled */
process_req(mp)
message *mp;
{
/* Takes advantage of global variable my_eid */
while(not-yet-done && cmp_eid(&cancel_id, &my_eid) != 0)
{
.... compute for a while, with a blocking
system call somewhere ....
}
}
\end{verbatim}
Polling is only slightly more awkward in a task that would run a long time
{\em without blocking}, for example while searching a large file.
Normally, such a task won't see ISIS messages at all,
because {\tt isis\_mainloop} never gets a chance to run.
To cicumvent this, one polls by checking for the arrival of incoming ISIS messages
and reading them as they arrive.
\begin{verbatim}
/* Accept a message to cancel an operation identified as ``opid'' */
event_id cancel_id;
cancel_req(mp)
message *mp;
{
msg_get(mp, "%d", cancel_opid);
}
/* Process a request, checking periodically lest it be canceled */
compute_req(mp)
message *mp;
{
/* Takes advantage of global variable my_eid */
while(not-yet-done && cmp_eid(&cancel_id, &my_eid) != 0)
{
.... compute for a while ....
isis_accept_events(flag);
}
}
\end{verbatim}
The {\tt isis\_accept\_events(flag)} routine will read and deliver any
\index{{\tt isis\_accept\_events}, used for polling}
pending messages or trigger any pending signal handlers
(see {\tt isis\_signal}), then return to the caller
after the resulting tasks have had a chance to run.
The flag tells ISIS whether to block (specify {\tt ISIS\_BLOCK}) or
run asynchronously (specify {\tt ISIS\_ASYNC}).
A third alternative uses a select-style timeout structure, here you
specify {\tt ISIS\_TIMEOUT} followed by an additional argument
giving the timeout interval as a pointer to a {\tt struct timeval}
(see SELECT(3)). \index{\tt ISIS\_ASYNC}\index{\tt ISIS\_TIMEOUT}
\index{\tt ISIS\_BLOCK}
Callers of this routine explicitly allow other tasks to execute
before it returns; this has an advantage if the goal is to
support a way of canceling a pending operation,
but has the disadvantage of potentially requiring inter-task synchronization code
to avoid race conditions or re-entrancy bugs.
The routine will return right away if there is no work to do and the flag is
0. It will wait for some work to do if the flag is {\tt ISIS\_BLOCK}.
{\em See also: } In the section on lightweight tasks in ISIS, we
discuss the task scheduling mechanism in more detail. This
includes pointers on how to modify existing programs with
a non-task structure to run most effectively under ISIS.
ded
in the data structures managed by the application.
In this case, one can apply updates as they arrive, but dispense with the log
and rollback mechanism.
What if synchronization {\em is} needed?
We describe mechanisms for synchronization in Chapter 11.
Using these, one can obtain mutual exclusion on behalf of the coordinator,
and then continue using any of the methods outlined above,
releasing the mutual exclusion or lock at the end of the
computation.
\section{Disk files}
There is no rdoc.tex 666 372 212 17150 4673505570 5460 \documentstyle[11pt,twoside]{report}
\makeindex
\title{The ISIS System Manual, Version 2.1}
\author{K. Birman, R. Cooper, T. Joseph, K. Marzullo, M. Makpangou \\
K. Kane, F. Schmuck and M. Wood \\ \\
V1.3.1 \copyright 1989 by The ISIS Project \\
V2.0 \copyright 1990 by The ISIS Project \footnote
{The ISIS system was developed with support from the Department of Defense
Advanced Research Projects Agency, DARPA, under ARPA order 5378, contract MDA903-85-C-0124 and
ARPA order 6037, contract N00140-87-C-8904, and DARPA/NASA contract NAG2-593.
The information contained in this manual is unclassified.
}
}
% Redefine @ as the subscript char, and _ as a plain underscore?
%\renewcommand{\tt}{
%\catcode`_=\active \newcommand{_}{\makebox[1.4mm][l]{\_}}
\newcommand{\newtextwidth}[3]{
\setlength{\textwidth}{#1}
\setlength{\columnwidth}{#1}
\setlength{\hsize}{#1}
\setlength{\linewidth}{#1}
\setlength{\oddsidemargin}{#2}
\setlength{\evensidemargin}{#3}
}
\newcommand{\hrf}{\ \hrulefill\ }
\newcommand{\Marginpar}[1]{\marginpar[\raggedright #1]{\raggedleft #1}}
\begin{document}
\maketitle
\pagenumbering{roman}
\cleardoublepage
\tableofcontents
\cleardoublepage
\chapter*{Preface}
\input{preface}
\cleardoublepage
\setcounter{page}{1}
\pagenumbering{arabic}
\pagestyle{myheadings}
\chapter{Getting Started}
\markboth
{\hrf \rm Getting Started\hrf \bf Chapter \thechapter}
{\bf Chapter \thechapter\hrf \rm Getting Started\hrf}
\input{start}
\cleardoublepage
\chapter{ISIS in the large}
\markboth
{\hrf \rm ISIS in the Large\hrf \bf Chapter \thechapter}
{\bf Chapter \thechapter\hrf \rm ISIS in the Large\hrf}
\input{arch}
\cleardoublepage
\chapter{The Major Components of the ISIS System}
\markboth
{\hrf \rm Major Components\hrf \bf Chapter \thechapter}
{\bf Chapter \thechapter\hrf \rm Major Components\hrf}
\input{major}
\cleardoublepage
\chapter{Basic Facilities}
\markboth
{\hrf \rm Basic Facilities\hrf \bf Chapter \thechapter}
{\bf Chapter \thechapter\hrf \rm Basic Facilities\hrf}
\input{basic}
\cleardoublepage
\chapter{More About Messages}
\markboth
{\hrf \rm More About Messages\hrf \bf Chapter \thechapter}
{\bf Chapter \thechapter\hrf \rm More About Messages\hrf}
\input{messages}
\cleardoublepage
\chapter{The Lightweight Task Subsystem}
\markboth
{\hrf \rm The Lightweight Task Subsystem\hrf \bf Chapter \thechapter}
{\bf Chapter \thechapter\hrf \rm The Lightweight Task Subsystem\hrf}
\input{tasks}
\cleardoublepage
\chapter{The Broadcast Interface}
\markboth
{\hrf \rm The Broadcast Interface\hrf \bf Chapter \thechapter}
{\bf Chapter \thechapter\hrf \rm The Broadcast Interface\hrf}
\input{bcinterface}
\cleardoublepage
\chapter{Virtual Synchrony}
\markboth
{\hrf \rm Virtual Synchrony\hrf \bf Chapter \thechapter}
{\bf Chapter \thechapter\hrf \rm Virtual Synchrony\hrf}
\input{vsync}
\cleardoublepage
\chapter{Replicated Data}
\markboth
{\hrf \rm Replicated Data\hrf \bf Chapter \thechapter}
{\bf Chapter \thechapter\hrf \rm Replicated Data\hrf}
\input{replicate}
\cleardoublepage
\chapter{Distributed and Parallel Executions}
\markboth
{\hrf \rm Distributed and Parallel Executions\hrf \bf Chapter \thechapter}
{\bf Chapter \thechapter\hrf \rm Distributed and Parallel Executions\hrf}
\input{distexec}
\cleardoublepage
\chapter{Synchronization Facilities}
\markboth
{\hrf \rm Synchronization Facilities\hrf \bf Chapter \thechapter}
{\bf Chapter \thechapter\hrf \rm Synchronization Facilities\hrf}
\input{synch}
\cleardoublepage
\chapter{Transactions}
\markboth
{\hrf \rm Transactions\hrf \bf Chapter \thechapter}
{\bf Chapter \thechapter\hrf \rm Transactions\hrf}
\input{trans}
\cleardoublepage
\chapter{Bypass communication}
\markboth
{\hrf \rm Bypass communication\hrf \bf Chapter \thechapter}
{\bf Chapter \thechapter\hrf \rm Bypass communication\hrf}
\input{bypass}
\cleardoublepage
\chapter{Logging, Spooling, and Long-Haul Facilities}
\markboth
{\hrf \rm Logging, Spooling, Long-Haul Facilities\hrf \bf Chapter \thechapter}
{\bf Chapter \thechapter\hrf \rm Logging, Spooling, Long-Haul Facilities\hrf}
\input{logtool}
\cleardoublepage
\chapter{Broadcast Types and Order}
\markboth
{\hrf \rm Broadcast Types and Order\hrf \bf Chapter \thechapter}
{\bf Chapter \thechapter\hrf \rm Broadcast Types and Order\hrf}
\input{types}
\cleardoublepage
\chapter{Advanced Facilities}
\markboth
{\hrf \rm Advanced Facilities\hrf \bf Chapter \thechapter}
{\bf Chapter \thechapter\hrf \rm Advanced Facilities\hrf}
\section{Building large process groups}
\input{large}
\section{The remote exec facility}
\input{rexec}
\section{Signals}
\input {sigs}
\section{Forking off a child from within ISIS}
\input{fork}
\section{Simulating multiple ISIS sites on one machine}
\input{multiple}
\section{Interacting with files and devices}
\input{control}
\section{Dealing with STREAM I/O connections}
\input{streams}
\section{Using ISIS in an X-windows program}
\input{xwindows}
\section{Using ISIS in a suntools program}
\input{suntools}
\section{News facility}
\input{news}
\section{The recovery manager}
\input{rmgr}
\section{Cmd---the interactive ISIS control program}
\input{cmd}
\section{Creating and interpreting client dumps}
\input {cdump}
\section{Creating and interpreting protocol process dumps}
\input {pdump}
\section{Load monitoring utility}
\input {prstat}
\section{How the system behaves under heavy load}
\input {oload}
\appendix
\cleardoublepage
\chapter{Setting Up ISIS}
\markboth
{\hrf \rm Setting Up ISIS\hrf \bf Appendix \thechapter}
{\bf Appendix \thechapter\hrf \rm Setting Up ISIS\hrf}
\input{setup}
\cleardoublepage
\chapter{Quick Reference}
\newtextwidth{6in}{0.25in}{0.25in}
\markboth
{\hrf \rm Quick Reference\hrf \bf Appendix \thechapter}
{\bf Appendix \thechapter\hrf \rm Quick Reference\hrf}
\input{quick}
\cleardoublepage
\chapter{Performance of the Toolkit Facilities}
\markboth
{\hrf \rm Performance of the Toolkit Facilities\hrf \bf Appendix \thechapter}
{\bf Appendix \thechapter\hrf \rm Performance of the Toolkit Facilities\hrf}
\label{Ch:perf}
Until recently, performance was a moving target in ISIS.
ISIS V1.3.1 was quite slow, but ISIS V2.0 seems about as fast as any
published system that uses multicast mechanism.
This version of ISIS is at least 5-10 times faster than V1.3 was.
A detailed paper on performance and scale issues in ISIS is under
preparation;
we will modify this appendix to include the results from that study
in a future release of the manual and system.
However, Cornell feels that performance is no longer a major problem
in ISIS.
\cleardoublepage
\chapter{Twenty questions and other demo software}
\markboth
{\hrf \rm Twenty questions\hrf \bf Appendix \thechapter}
{\bf Appendix \thechapter\hrf \rm Twenty questions\hrf}
\input{demos}
\cleardoublepage
\chapter{Calling ISIS from UNIX Fortran (F77) programs}
\markboth
{\hrf \rm FORTRAN\hrf \bf Appendix \thechapter}
{\bf Appendix \thechapter\hrf \rm FORTRAN\hrf}
\input{fortran}
\cleardoublepage
\chapter{ISIS and non-ISIS task packages}
\markboth
{\hrf \rm NON-ISIS TASKS\hrf \bf Appendix \thechapter}
{\bf Appendix \thechapter\hrf \rm NON-ISIS TASKS\hrf}
\input{nstasks}
\cleardoublepage
\chapter{Using ISIS from ALLEGRO LISP programs}
\markboth
{\hrf \rm LISP\hrf \bf Appendix \thechapter}
{\bf Appendix \thechapter\hrf \rm LISP\hrf}
\input{lisp}
\chapter{Using the META subsystem}
\markboth
{\hrf \rm META\hrf \bf Appendix \thechapter}
{\bf Appendix \thechapter\hrf \rm META\hrf}
\input{meta}
\cleardoublepage
\markboth
{\hrf \bf Index}
{\bf Index\hrf}
\input{index}
\end{document}
Chapter \thechapter\hrf \rm Transactions\hrf}
\input{trans}
\cleardoublepage
\chapter{Bypass communication}
\markboth
{\hrf \rm Bypass communication\hrf \bf Chapter \thechapter}
{\bf Chapter \thechapter\hrf \rm Bypass communication\hrf}
\input{bypass}
\cleardoublepage
\chapter{Logging, Spooling, and Long-Haul Facilities}
\markboth
{\hrf \rm Logging, Spooling, Long-Haul Facilities\hrf \bf Chaepsf.tex 644 525 212 12656 4725300421 5634 % EPSF.TEX macro file:
% Written by Tomas Rokicki of Radical Eye Software, 29 Mar 1989.
% Revised by Don Knuth, 3 Jan 1990.
% Revised by Tomas Rokicki to accept bounding boxes with no
% space after the colon, 18 Jul 1990.
%
% TeX macros to include an Encapsulated PostScript graphic.
% Works by finding the bounding box comment,
% calculating the correct scale values, and inserting a vbox
% of the appropriate size at the current position in the TeX document.
%
% To use, simply say
% \input epsf % somewhere early on in your TeX file
% \epsfbox{filename.ps} % where you want to insert a vbox for a figure
%
% The effect will be to typeset the figure as a TeX box, at the
% point of your \epsfbox command. By default, the graphic will have its
% `natural' width (namely the width of its bounding box, as described
% in filename.ps). The TeX box will have depth zero.
%
% You can enlarge or reduce the figure by saying
% \epsfxsize= \epsfbox{filename.ps}
% instead. Then the width of the TeX box will be \epsfxsize, and its
% height will be scaled proportionately.
% (The \epsfbox macro resets \epsfxsize to zero after each use.)
%
% If you want TeX to report the size of the figure (as a message
% on your terminal when it processes each figure), say `\epsfverbosetrue'.
%
\newread\epsffilein % file to \read
\newif\ifepsffileok % continue looking for the bounding box?
\newif\ifepsfbbfound % success?
\newif\ifepsfverbose % report what you're making?
\newdimen\epsfxsize % horizontal size after scaling
\newdimen\epsfysize % vertical size after scaling
\newdimen\epsftsize % horizontal size before scaling
\newdimen\epsfrsize % vertical size before scaling
\newdimen\epsftmp % register for arithmetic manipulation
\newdimen\pspoints % conversion factor
\pspoints=1truebp % Adobe points are `big'
\epsfxsize=0pt % Default value, means `use natural size'
%
\def\epsfbox#1{%
%
% The first thing we need to do is to open the
% PostScript file, if possible.
%
\openin\epsffilein=#1
\ifeof\epsffilein\errmessage{I couldn't open #1, will ignore it}\else
%
% Okay, we got it. Now we'll scan lines until we find one that doesn't
% start with %. We're looking for the bounding box comment.
%
{\epsffileoktrue \chardef\other=12
\def\do##1{\catcode`##1=\other}\dospecials \catcode`\ =10
\loop
\read\epsffilein to \epsffileline
\ifeof\epsffilein\epsffileokfalse\else
%
% We check to see if the first character is a % sign;
% if not, we stop reading (unless the line was entirely blank);
% if so, we look further and stop only if the line begins with
% `%%BoundingBox:'.
%
\expandafter\epsfaux\epsffileline:. \\%
\fi
\ifepsffileok\repeat
\ifepsfbbfound\else
\ifepsfverbose\message{No bounding box comment in #1; using defaults}\fi
\global\def\epsfllx{72}%
\global\def\epsflly{72}%
\global\def\epsfurx{540}%
\global\def\epsfury{720}\fi
}\closein\epsffilein
%
% Now we have to calculate the scale and offset values to use.
% First we compute the natural sizes.
%
\epsfrsize=\epsfury\pspoints
\advance\epsfrsize by-\epsflly\pspoints
\epsftsize=\epsfurx\pspoints
\advance\epsftsize by-\epsfllx\pspoints
%
% If `epsfxsize' is 0, we default to the natural size of the picture.
% Otherwise we scale the graph to be \epsfxsize wide.
%
\ifnum\epsfxsize=0 \epsfxsize=\epsftsize \epsfysize=\epsfrsize
%
% We have a sticky problem here: TeX doesn't do floating point arithmetic!
% Our goal is to compute y = rx/t. The following loop does this reasonably
% fast, with an error of at most about 16 sp (about 1/4000 pt).
%
\else\epsftmp=\epsfrsize \divide\epsftmp\epsftsize
\epsfysize=\epsfxsize \multiply\epsfysize\epsftmp
\multiply\epsftmp\epsftsize \advance\epsfrsize-\epsftmp
\epsftmp=\epsfxsize
\loop \advance\epsfrsize\epsfrsize \divide\epsftmp 2
\ifnum\epsftmp>0
\ifnum\epsfrsize<\epsftsize\else
\advance\epsfrsize-\epsftsize \advance\epsfysize\epsftmp \fi
\repeat
\fi
%
% Finally, we make the vbox and stick in a \special that dvips can parse.
%
\ifepsfverbose\message{#1: width=\the\epsfxsize, height=\the\epsfysize}\fi
\epsftmp=10\epsfxsize \divide\epsftmp\pspoints
\vbox to\epsfysize{\vfil\hbox to\epsfxsize{%
\special{psfile=#1 llx=\epsfllx\space lly=\epsflly\space
urx=\epsfurx\space ury=\epsfury\space rwi=\number\epsftmp}%
\hfil}}%
\fi\epsfxsize=0pt}%
%
% We still need to define the tricky \epsfaux macro. This requires
% a couple of magic constants for comparison purposes.
%
{\catcode`\%=12 \global\let\epsfpercent=%\global\def\epsfbblit{%BoundingBox}}%
%
% So we're ready to check for `%BoundingBox:' and to grab the
% values if they are found.
%
\long\def\epsfaux#1#2:#3\\{\ifx#1\epsfpercent
\def\testit{#2}\ifx\testit\epsfbblit
\message{<#3>}
\epsfgrab #3 . . . \\%
\epsffileokfalse
\global\epsfbbfoundtrue
\fi\else\ifx#1\par\else\epsffileokfalse\fi\fi}%
%
% Here we grab the values and stuff them in the appropriate definitions.
%
\def\epsfnaught{}%
\def\epsfgrab #1 #2 #3 #4 #5\\{\message{<#1,#2,#3,#4,#5>}%
\global\def\epsfllx{#1}\ifx\epsfllx\epsfnaught
\epsfgrab #2 #3 #4 #5 .\\\else
\global\def\epsflly{#2}%
\global\def\epsfurx{#3}\global\def\epsfury{#4}\fi}%
%
% Finally, another definition for compatibility with older macros.
%
\let\epsffile=\epsfbox
fepsfbbfound % success?
\newif\ifepsfverbose % report what you're making?
\newfork.tex 666 420 420 6513 4543250173 5620 In general, either {\tt fork}, {\tt vfork}, or both system calls can
\index{{\tt fork} within ISIS applications}
\index{{\tt vfork} within ISIS applications}
be invoked from within an ISIS application,
but some precautions are needed to ensure that the
child doesn't do things that would confuse ISIS.
We are a bit vague about which call will work because some operating systems
don't handle one or both calls correctly in the case of processes with
lightweight tasks running in them.
ISIS defines a procedure {\tt isis\_dofork} that calls whichever
routine seems to work best and returns the result.
Here's a machine by machine tally:
\begin{enumerate}
\item[SUN3]
Under OS 4.0.1, we find that vfork has problems with ISIS processes
even if we compile with -llwp. Perhaps the vfork call, which
duplicates ``only the active stack frame'', is unable to make sense
of the dynamically allocated stack frame used by the lightweight
process package. Fork, however, seems to work correctly here.
\item[SUN4]
Same comments, but you also need a ``pragma'' file (vfork.h) to warn
the compiler that the vfork routine modified registers in a non-standard way.
Unfortunately, SUN4 is the only machine that has this include file in it.
\item[HPUX]
Last time we checked, fork worked but vfork didn't. We currently
use fork and this seems to work reasonably well.
\item[APOLLO]
Under the APOLLO UNIX, fork and vfork are both broken.
We ended up hacking around this using a routine called {\tt isis\_fork\_and\_execve},
source for which can be found in {\tt clib/cl\_isis.c}.
\item[VAX]
On the VAX both calls seem to work despite the ISIS implementation
of lightweight tasks.
We use vfork.
\item[MACH]
On MACH systems both calls seem to work under cthreads.
We use vfork.
\end{enumerate}
It is important that the child process, immediately after the fork
or vfork, call {\tt isis\_disconnect}.
This severs the connection with ISIS.
It is then safe to do an {\tt exec}.
\index{{\tt exec} within ISIS applications}
If the child fails to sever the connection its parent was
using to communicate with ISIS, ISIS may be {\em unable to
detect when the parent process terminates.}
This is because ISIS detects client failures by observing that the
communication channel to a client has closed.
To detect child termination, use the
function {\tt isis\_chwait(routine, arg)}.
{\tt isis\_chwait} will fork a task executing {\tt routine(arg)}
in the event that a child process dies.
(If multiple terminations occur,
it will fork the task once per termination).
This task would normally do a {\tt wait} system
call to pick up child termination status.
The function {\tt isis\_chwait\_sig(cond)} is also available;
it requests that ISIS signal {\tt cond} when a child dies.
ISIS does not currently permit child processes
to connect to ISIS without having done an exec first.
Currently, if a process forks a child
that calls {\tt isis\_init} immediately, the call will fail.
{\em Note: } Although a child process should probably belong to all the process
groups to which its parent belonged, ISIS does not currently
support this, and it may turn out to be a very hard feature to implement.
For the moment, the child is a new process, inheriting none of the ISIS-maintained
properties of its parent.
It should join any process groups to which you want it to belong, re-register any
watch operations that are desired, etc.
ich call will work because some operating systems
don't handle one or both calls correctly in the case of processes with
lightweight tasks running in them.
ISIS defines a procedure format.tex 666 420 420 16754 4570121040 6165 \subsection*{Format strings}\label{fmt}
\index{format strings}
Format strings are used to denote the number and the type of the arguments
following the format string, in a manner similar to the functions
{\tt fprintf} and {\tt fscanf}.
A format string is a null terminated string containing zero or more of
the format items described below.
You may include other printable characters in a format
string to improve readability; these are ignored in constructing or reading the
message and serve the same function as a comment.
The basic format items and the types they represent are given below.
A number of examples are given at the end of this section, and
a more complete list appears later in this documentation.
\index{format items}
\begin{description}
\item[{\tt \%c}] A variable of type {\tt char}.
\item[{\tt \%s}] A pointer to a null-terminated string of characters.
\item[{\tt \%d}] A 4-byte integer variable.
\item[{\tt \%l}] Same as {\tt \%d}.
\item[{\tt \%h}] A 2-byte integer variable.
\item[{\tt \%a}] A variable of type {\tt address}.
\item[{\tt \%e}] A variable of type {\tt event\_id}.
\item[{\tt \%b}] A variable of type {\tt bitvec}.
\item[{\tt \%p}] A process-group ``view''.
\item[{\tt \%m}] A message pointer.
\item[{\tt \%f}] A single-precision floating point variable.
\item[{\tt \%g}] A double-precision floating point variable.
\end{description}
In {\tt msg\_put}, a format item in the format string means that the string
will be followed by an argument of the corresponding type, the order of the
arguments being the same as the order of format items in the string.
{\tt msg\_put} copies the data from these arguments into the message.
In the case of a message pointer ({\tt \%m}), the message pointed to is
copied as a whole into the other message.
There are some restrictions: ``by value'' argument passing
turns out to be non-portable in the case of structures, hence
formats corresponding to structures (a,e,b,p) as well as user-
defined format types (see below) can only be output using
the array notation (also described below).
In {\tt msg\_get}, each format item in the format string means that the
string will be followed by {\em a pointer to} a variable
of the corresponding type.
The data from the message is copied into the variable.
In the case of a message pointer ({\tt \%m}), the corresponding variable is
set to point to a message copied out of the original message.
The message copied out must be explicitly deleted by calling {\tt
msg\_delete}.
All formats are supported in {\tt msg\_get} because
by-value argument passing is not used in this case.
Most of the format items above have an array version, obtained by replacing
\index{array format items}
the lower case letter in the item by the corresponding upper case letter
(e.g. {\tt \%D} for an array of 4-byte integers).
The two exceptions are {\tt \%s} and {\tt \%m}.
{\tt msg\_put} and {\tt msg\_get} expect {\em two} arguments for
each array item.
In {\tt msg\_put} the first argument is an
array of the corresponding type (or equivalently a pointer to the
first element of such an array) and the second is the number of elements
to copy into the message.
In {\tt msg\_get} the first argument is a pointer to an area large
enough to store the array and
the second is a pointer to an integer.
{\tt msg\_put} sets the value of this integer to the number of array elements
read out of the message.
When the data to be put in a message gets fairly large (more than about
256 bytes) it may be preferable to avoid copying the information into
the message. Such copying can be eliminated using the format
{\tt \%*X} where 'X' is one of the upper-case format items.
When this mechanism is used, ISIS stores an {\em indirect data reference}
into the message: a pointer to the user-supplied data area.
Needless to say, while the data is referenced this way the user should
take pains not to change it.
To assist in implementing this policy, a
third item is required when this ``data indirection'' scheme is
used, namely the address of a procedure that ISIS should call back
when the pointer to the user's data buffer is no longer needed (i.e.
the message has been fully deallocated).
This routine will be called with the address of the data object
as its only argument.
In some cases, such as a call to {\tt msg\_put} to output a single
object of type {\tt address},
the by-value restriction forces the user to treat the
object as an array of length one.
We found it awkward to include the constant 1 as an argument in
this case, hence a format type {\tt \%X[1]} (or in general,
{\tt \%X[constant]}) is supported. In this case, the
length is taken from the format item and only one argument, namely
the pointer to the object to include, is expected.
(When combining the data indirection notation with the constant-length
array notation, the free routine is still needed and hence there
will be two arguments).
Using {\tt msg\_get} as above to read an array out of a message requires that
the maximum size of the array be known in advance.
A {\tt malloc} version is provided for each type of format item to cover
the case where this size is not known.
This version of format item has the symbol ``{\tt +}'' after the ``{\tt \%}''
(e.g. {\tt \%+D} for an array of 4-byte integers).
\index{automatic {\tt malloc} using {\t \%+X} format items}
When used in {\tt msg\_get}, ISIS will allocate enough space for the array
using the {\tt malloc} system call.
You must explicitly release this space by calling {\tt free}.
(In the case of a 0-length object, ISIS will generate a null pointer).
Again, {\tt msg\_get} expects two arguments for each {\tt \%+} format item.
The first one is a pointer to a {\em pointer variable} of the given type and the second is
a pointer to an integer variable.
The pointer variable is set to point to the start of the {\tt malloc}'ed
space, while the integer is set to the number of array elements read out of
the message.
If you don't wish to know the length, the second argument can be
given as a null pointer.
Alternatively, if you know that the length will be some constant,
you can use the {\tt \%X[..]} notation.
However, be aware that in this case, ISIS will return {\tt IE\_MISMATCH}
if the length specified and the length of the object in the message
don't correspond.
If you wish to avoid copying arrays on reception,
you may use the ``in place'' versions
of the array format items.
This version of format item has the symbol ``{\tt -}'' after the ``{\tt \%}''
(e.g. {\tt \%-D} for an array of 4-byte integers).
\index{pointer into message using {\t \%-X} format items}
The arguments expected by {\tt msg\_get} are the same as for the {\tt
malloc} version.
The difference is that instead of returning a pointer to {\tt malloc}'ed
space into which the array is copied, {\tt msg\_get} returns a pointer to
the array {\em in the message itself}.
Use this feature with care, as unpredictable things can happen if you use
the pointer to write new data into the array, or if you try to access the
array after the message has been deleted.
Notice that the in-place format item used when sending a message
is ``{\tt \%*}'' whereas the one used when receiving a message is
``{\tt \%-}'' or ``{\tt \%+}.
Some users have complained that this prevents them from using
identical format items when sending and receiving the same message,
making code harder to maintain.
We considered using the same format item for both of these
cases, but because the detailed meanings are quite different and
the arguments required are so different, felt that it might
simply confuse less sophisticated ISIS users if the format items
were the same.
opying can be eliminfortran.tex 666 420 212 17606 4575756721 6375 Most ISIS functions can be invoked directly from UNIX FORTRAN\index{FORTRAN interface to ISIS},
although
these feature has not been tested widely on machines other than SUN 3 and
SUN 4 computers. The calling sequences differ from the usual
ISIS calls primarily in name, because some versions of
FORTRAN do not permit the underscore
character in function names.
Some other minor differences are discussed below.
The general rule we followed is to mimic the ISIS interface as closely as possible.
In particular, fortran calls to ISIS use the same name as from C.
For compatibility with versions of fortran that disallow the underscore
character in variable names, we have two copies of the interface, one
with underscores, and the second with
underscores simply deleted
(e.g. {\tt pgjoin} in place of {\tt pg\_join}).
Things like process group names are passed as null-terminated
strings, which is how fortran compiles string constants enclosed in
double-quote marks on the SUN.
The only major difference between fortran programs that use ISIS and
C programs that use ISIS is that fortran programs cannot allocate or manipulate
addresses directly.
Instead, fortran code always works with pointers to addresses, declaring them
to be of type INTEGER*4.
Routines for getting at the information in these addresses and for allocating
new ones dynamically are provided.
It is important to understand that fortran programs are NOT permitted
to treat arrays of integers as ``addresses''.
There is no direct way to manipulate addresses from fortran.
Thus, where a function returns a non-integer value (as does {\tt pg\_join}, which returns
a pointer to an address) we
simply ``cast' it into an INTEGER*4 value.
If you really need to manipulate addresses explicitly, we suggest
that you code routines in C for this purpose and link them into your
fortran program.
Some warnings and comments:
\begin{enumerate}
\item When an ISIS routine is called with a routine as an argument,
the calling FORTRAN subroutine or function must {\em always}
declare that routine to
be EXTERNAL, even if the subroutine appears in the same file as
the routine that referenced its name.
The problem is that names have purely local scope in fortran.
Lacking this declaration, fortran will allocate a variable of the same name and
pass its address, which causes a lot of confusion when ISIS tries to call the
procedure back! In general, ISIS will try and detect this
mistake and print an error message if you
omit an EXTERNAL declaration.
\item As for addresses, fortran routines will
need to work with pointers to event-id structures, site-views, groupviews, messages,
etc. by declaring these as INTEGER*4.
\item From fortran, a condition variable (for use in twait() or tsig()) is
represented as an INTEGER*4 variable, initialized to 0.
All the task mechanisms can be called from FORTRAN.
\item
A single exception has been made to the rule that fortran routines can't
allocate structures and arrays of structures. This exception is that
fortran routines that manipulate or pass site-id's can use arrays of INTEGER *2
variables for this purpose.
A null-terminated list of site-id's, as for use in {\tt isisrexec} ({\tt isis\_rexec} when
called from C), would be represented as an array of site-id's terminated by a 0 entry.
Fortran routines are not permitted to call the broadcast routines with arrays of
addresses, however.
\item Many ISIS functions return integer results, but have names
starting with non-standard characters, such as {\tt addrismine}.
A FORTRAN caller should be careful to declare such functions correctly.
\item
The format strings used for passing/getting addresses are different
from what you might expect. See below for details.
\end{enumerate}
We recommend that FORTRAN users employ the C-preprocessor and include the
file ``fisis.h'', which defines the constants that one typically
needs access to from an ISIS application program.
In standard UNIX,
if a FORTRAN source file has a name ending in .F, the processor is run
automatically to convert it to a file with a name ending in .f (lower case), which
then runs through the F77 compiler.
An example is include in the ISIS demos area: ftest.F. \index{ftest.F (fortran demo program)}
In this file, the ISIS related constants are given the same name
as the corresponding C constant, but with underscores removed from the name,
e.g. {\tt PGXFER} in place of {\tt PG\_XFER}.
Note that if your fortran supports underscores in function names,
they can be included in calls to ISIS.
What follows is a list of ISIS functions accessible from FORTRAN with calling
sequences that {\em differ} from the standard ones for ISIS.
The ISIS calls not listed below
are supported with the identical calling sequence
as for normal ISIS calls (except for changes in the routine name and
argument types outlined above).
\begin{verbatim}
FORTRAN CALL EQUIVALENT ISIS CALL
addr = address(...) addr = &ADDRESS(...) (returns pointer instead of value)
xoutcomes(xlist, ....) copies x_outcomes(partname) into xlist.
isissleep(secs) task sleeps for seconds
isissleepms(ms) task sleeps for milliseconds
ROUTINES FOR ACCESSING FIELDS OF AN ADDRESS:
addrprocess(a) a.process
addrentry(a) a.entry
addrsite(a) a.site
addrincarn(a) a.incarn
ROUTINES FOR OBTAINING VALUES OF STANDARD CONSTANTS
mysno() my_site_no
mysincarn() my_site_incarn
mysid() my_site_id
mypid() my_process_id
myaddress() my_address
nulladdress() NULLADDRESS
myhost(where) copies my_host into character array
sitenames(where, i) copies name of site with number into
isisdir(where) copies isis_dir into
isiserrno() isis_errno
isisnsent() isis_nsent
isisnreplies() isis_nreplies
isissocket() isis_socket
isisstate() isis_state
ROUTINES FOR ACCESSING FIELDS OF A GROUPVIEW STRUCTURE
gvviewid(gv) gv->gv_viewid
gvincarn(gv) gv->gv_incarn
gvflag(gv) gv->gv_flag
gvgaddr(gv) gv->gv_gaddr
gvname(where, gv) copies gv->gv_name into
gvnmemb(gv) gv->gv_nmemb
gvnclient(gv) gv->gv_nclient;
gvmember(gv, n) gv->gv_members(n)
gvclient(gv, n) gv->gv_clients(n)
gvjoined(gv) gv->gv_joined
gvdeparted(addr, gv) gv->gv_departed
pmembers(gv) calls paddrs(gv->gv_members)
pclients_(gv) calls paddrs(gv->gv_clients)
ROUTINES FOR ACCESSING FIELDS OF A SITE-VIEW:
svviewid(sv) sv->sv_viewid
svslist(sv, i) sv->sv_slist[i]
svincarn(sv, i) (int)sv->sv_incarn[i]
bv = svfailed(sv) bv = (int)&sv->sv_failed
bv = svrecovered(sv) bv = (int)&sv->sv_recovered
ROUTINES FOR ACCESSING FIELDS OF A BIT VECTOR (array of 4 LOGICAL*4):
bit(bv, i) bit(bv, i)
bis(bv, i) bis(bv, i)
bclr(bv) bclr(bv, i)
btst(bv) btst(bv, i)
bzero(bv) bzero(bv)
\end{verbatim}
The format strings used to put an address into a message from
FORTRAN will typically be of the ``\%A[1]'' variety.
The format strings used to extract an address from a message would
typically specify ``\%-a''; in a call to {\tt msgget}
the corresponding variable
would be declared to be of type INTEGER*4.
Notice that although one {\em could} use a ``\%A[1]'' format in
{\tt msg\_get}, this would copy the address into an array in the FORTRAN
data space. Since ISIS expects FORTRAN to pass an integer variable
containing a pointer to an address, an address maintained
by value in this manner is useless to the
FORTRAN programmer. (Were such a variable passed into a routine like
bcast, ISIS would receive an argument of type {\tt address*} when it
expected an argument of type {\tt address**}, and a core dump would
probably result).
ource file has a name ending in .F, the processor is run
automatically to convert it to a file with a name ending in .f (lhpblurb.tex 666 420 420 6455 4354220146 6317 \documentstyle{article}
\begin{document}
{\bf SOFTWARE PACKAGE AVAILABLE} \\ \\
{\bf Industry:} {\em Cross Industry: HP9000 network users} \\ \\
{\em Abstract}: ISIS is a toolkit for constructing distributed
programs under HPUX and other versions of UNIX. ISIS provides a wide
range of mechanisms for problems that include management of
replicated in-core or disk data structures, distributing a computational
task over a set of programs, synchronization of access to shared resources,
fault-tolerance, automated restart after failures, integration of a
recovered component into an online system, and dynamic reconfiguration
to adapt to changing loads, failures, or other application-specific events. \\ \\
ISIS is unusual because it supports a {\em virtually synchronous} style of
execution and includes {\em process grouping} and {\em group communication}
mechanisms. These help in structuring distributed
computations and simplify the implementation of
distributed algorithms.
As a result, ISIS may be the only existing system in which
programmers can rapidly build fault-tolerant distributed software with
no special training. \\ \\
ISIS is also highly inter-operable, and will support networks containing
a wide variety of vendor hardware and versions of UNIX, including the HP9000
series 300 and 600 machines running HPUX,
DEC's VAX systems (under both ULTRIX and the Mt. Xinu versions of UNIX),
SUN 3 and SUN 4 systems, Apollo, and Gould. A port to the MACH system on the
NEXT machine is now underway, and a DEC VMS port is planned for late in 1989.
Byte swapping and other data format conversions are performed automatically,
and there is a minimum of distributed ``connection establishment'' mechanism
to deal with.
Although ISIS is currently used only within C programs, during
the first quarter of 1989 interfaces to FORTRAN 77 and Common lisp
will be released. A single application can combine software written
in any of these languages and running on any combination of these machines.\\ \\
Typical ISIS applications include a manufacturing company that is using ISIS
on the factory floor, as part of a VLSI fabrication system, a
signal processing firm that is using ISIS in part of a world-wide
system for collection and analysis of seismic signals, and a stock brokerage
firm that is using ISIS as part of its on-line trading systems. More than
170 copies of ISIS have now been distributed world-wide. \\ \\
ISIS was developed and is presently supported by the
ISIS Project in the Department of Computer Science at Cornell University.
The system itself is publically available at no fee. \\
{\em Address:} \parbox{3in}{
ISIS Project (Attn: M. Schimizzi) \\
Department of Computer Science \\
Upson Hall \\
Cornell University \\
Ithaca, New York 14853} \\ \\
{\em Phone:} 607-255-9198 \\ \\
{\em E-mail:} schiz@cs.cornell.edu \\ \\
{\em Contact person:} Margaret Schimizzi\\ \\
{\em Support:} Although the ISIS Research group is supporting ISIS
to the extent that it can, it is also possible to obtain support
contracts and additional
consulting services on a fee-basis from a company affiliated with the group. \\ \\
{\em Address:} \parbox{3in}{
Dr. Kennth P. Birman, President \\
ISIS Distributed Systems, Inc.\\
501 Salem Drive\\
Ithaca, New York 14850 } \\ \\
{\em Phone:} 607-255-9199 \\ \\
{\em E-mail:} birman@cs.cornell.edu
\end{document}
ocess grouping} and {\em group communication}
mechanisms. These help in structuring distributed
computations and simplify the implementation of
distributed algorithms.
As a result, ISIS may be the only existingindex.tex 664 372 212 55672 4675161550 6031 \begin{theindex}
\item {{\tt /etc/rc.local} configuration file}, 335
\item {{\bf /etc/services} file}, 333
\item {{\tt /etc/services} file}, 7, 136
\item {/usr/bin/startisis shell script}, 135, 336
\item {{\tt /usr/bin/startisis} shell script (for auto-restart)}, 358
\indexspace
\item {{\tt A} option to {\tt isis}}, 335
\item {\tt abcast}, 287, 350
\item {\tt abortreply}, 102, 173, 350
\item {\tt addr\_cmp}, 346
\item {\tt addr\_isequal}, 346
\item {\tt addr\_ismine}, 346
\item {\tt addr\_isnull}, 346
\item {addresses}, 2, 86
\subitem {\tt addr\_cmp}, 87
\subitem {\tt addr\_isequal}, 87
\subitem {\tt addr\_ismine}, 87
\subitem {\tt addr\_isnull}, 87
\subitem {\tt my\_address}, 86
\subitem {\tt NULLADDRESS}, 86
\subitem {permanence}, 86
\subitem {printing (\tt paddr)}, 87
\subitem {structure defined}, 88, 345
\item {\tt ALL}, 99, 171, 350
\item {\tt allow\_xfers\_xd}, 120, 121, 352
\item {array format items}, 93, 95, 140
\item {\em atomicity}, 178
\item {authentication during join}, 107
\item {auto-start of ISIS system}, 135
\item {automatic {\tt malloc} using {\t \%+X} format items}, 94, 143
\indexspace
\item {bank demo}, 365
\item {\tt bc\_cancel}, 175
\item {\tt bc\_getevent}, 174, 350
\item {\tt bc\_poll}, 175
\item {\tt bc\_wait}, 174
\item {\tt bcast}, 13, 350
\item {{\em bcast} port in {\tt /etc/services}}, 136
\item {\tt bcast}
\subitem {address specified as a list}, 171
\subitem {examples of reply formats}, 169
\subitem {fork option}, 171, 174
\subitem {how replies are stored in memory}, 168
\subitem {long form}, 171
\subitem {number of messages sent / replies received}, 168
\subitem {number of replies desired}, 171
\subitem {reply format mismatch}, 170
\subitem {reply includes a vector}, 168
\subitem {short form}, 99, 167
\subitem {when will it block}, 170
\item {\tt bcast\_l}, 171, 350
\item {\tt bccancel}, 350
\item {\tt bcid}, 174
\item {\tt bclr}, 354
\item {\tt bcpoll}, 350
\item {\tt bcwait}, 350
\item {\tt bic}, 354
\item {\tt bicv}, 354
\item {{\tt bin/spooler}}, 270
\item {\tt bis}, 354
\item {\tt bisv}, 354
\item {\tt bit}, 354
\item {\tt bitvec}, 353
\item {broadcasts}, 350
\subitem {asynchronous}, 287
\subitem {event identifier (bcid)}, 174
\subitem {introduction}, 98
\subitem {usage of}, 2
\item {\tt bset}, 354
\item {bulletin board facility}, 306
\item {bypass communication}, 176, 235
\indexspace
\item {\em causality}, 178, 285
\item {\tt cbcast}, 178, 284, 350
\subitem {bypass version}, 176, 235
\item {{\tt cc\_terminate\l}}, 196
\item {{\tt cc\_terminate}}, 196
\item {cc\_terminate}
\subitem {cc\_terminate\_l}, 356
\item {client dumps}
\subitem {generating with \tt cl\_dump()}, 315
\subitem {generating with {\tt kill} command}, 317
\subitem {interpreting}, 315
\item {cluster of sites}, 84
\item {{\tt cmd} tool}, 311, 360
\subitem {how to run}, 311
\subitem {sending interactive messages with}, 312
\item {\tt condition}, 356
\item {configuration files used by ISIS}, 333
\item {configuration of a process group}, 31
\item {congestion (overload)}, 328
\item {connection failures}, 341
\item {connection of remote clients to ISIS system}, 136
\item {converting old programs to run under ISIS}, 133
\item {coordinator-cohort computation}, 194
\subitem {\tt coord\_cohort}, 356
\subitem {coordinator selection rule}, 197
\subitem {deciding when to use}, 200
\subitem {in a transactional setting}, 232
\subitem {need to inhibit joins during}, 197
\subitem {use of replicated data in}, 207
\item {crashes}, 335
\indexspace
\item {directory where ISIS is started up}, 337
\item {distributed computation}, 201
\subitem {aborting while in progress}, 208
\subitem {coordinator-cohort}, 194
\subitem {redundant}, 192
\subitem {use of replicated data in}, 205
\item {{\sc dump} command in {\tt cmd} tool}, 312
\indexspace
\item {{\em entry} number}
\subitem {declaring}, 98
\subitem {defined}, 10, 88
\subitem {entry undefined}, 99
\item {error numbers}, 83
\item {\em event ordering}, 22
\item {{\tt event\_id} structure}, 175
\item {{\tt exec} within ISIS applications}, 300
\item {{\tt exit} system call in ISIS clients}, 9
\item {external ordering of broadcasts}, 178
\indexspace
\item {failures}, 178
\item {\tt fbcast}, 284
\item {files}, 359
\subitem {isis.h}, 359
\subitem {libisis1.a}, 359
\subitem {libisis2.a}, 359
\subitem {libisism.a}, 359
\subitem {rexec.log}, 360
\subitem {rmgr.rc}, 360
\subitem {sites}, 333, 360
\subitem {used within ISIS applications}, 301
\item {\tt flush}, 350
\item {flush}
\subitem {used in transactions}, 227
\item {\tt flush}
\subitem {with files or devices}, 301
\item {{\tt fork} within ISIS applications}, 299
\item {format items}, 92, 141
\subitem {defining new ones}, 147
\item {format strings}, 92, 140, 348
\item {FORTRAN interface to ISIS}, 381
\item {\tt forward}, 2, 102, 173, 350
\item {ftest.F (fortran demo program)}, 382
\indexspace
\item {\tt gbcast}, 288, 350
\item {\em group addressing}, 26
\item {{\sc group} command in {\tt cmd} tool}, 315
\item {{\tt groupview} data structure}, 9, 112, 353
\indexspace
\item {hanging}
\subitem {failure to call {\tt isis\_start\_done}}, 132
\item {hierarchical process groups}, 293
\indexspace
\item {\em I am dead. Site xxx/xxx told me so.}, 85
\item {\tt IE\_MISMATCH}, 170
\item {\tt IE\_RESTRICTED}, 172
\item {\tt IE\_TOTFAIL}, 123
\item {\tt IE\_XXXX}, 345
\item {{\em incarn} number}, 88
\item {{\sc index} command in {\tt cmd} tool}, 312
\item {\tt ISAGID}, 346
\item {\tt ISAPID}, 346
\item {\tt isis}, 360
\item {{\tt isis} command}, 333
\item {ISIS monitoring utility ({\tt prstat})}, 326
\item {ISIS not operational at this site}, 341
\item {ISIS startup directory}, 337
\item {ISIS system shutdowns}, 335
\item {\tt isis}
\subitem {{\tt -Hn} option}, 301
\item {\tt isis.h}, 130, 359
\item {{\tt isis.rc} configuration file}, 333
\item {\tt isis\_accept\_events}, 298
\subitem {used for polling}, 209
\item {\tt ISIS\_ASYNC}, 209, 298, 389
\item {\tt ISIS\_AUTOSTART}, 358
\item {{\tt ISIS\_AUTOSTART} option to {\tt isis\_init\_l}}, 135, 336
\item {\tt ISIS\_BLOCK}, 209, 298, 389
\item {\tt isis\_chwait}, 357
\item {\tt isis\_chwait\_sig}, 357
\item {\tt isis\_define\_type}, 147, 350
\item {\tt isis\_entry}, 10, 98, 130, 357, 386
\item {\tt isis\_entry\_stacksize}, 163, 357
\item {\tt isis\_errno}, 83, 345
\item {\tt isis\_errno.h}, 83
\item {\tt isis\_except}, 160, 357
\item {\tt isis\_except\_sig}, 161, 357
\item {\tt isis\_failed}, 357
\item {\tt isis\_inhibit\_joins}, 108, 198, 352
\item {\tt isis\_init}, 130, 357
\item {\tt isis\_init\_l}, 130, 357
\item {{\tt isis\_init\_l}}, 7, 135
\item {\tt isis\_input}, 130, 160, 302, 357
\item {\tt isis\_input\_sig}, 130, 161, 302, 357
\item {\tt isis\_logentry}, 130, 248
\item {\tt isis\_logged}, 357
\item {\tt isis\_mainloop}, 9, 130, 357
\item {\tt isis\_nreplies}, 13, 352
\item {\tt isis\_nsent}, 13, 352
\item {\tt isis\_output}, 160, 357
\item {\tt isis\_output\_sig}, 161, 357
\item {\tt ISIS\_PANIC}, 358
\item {{\tt ISIS\_PANIC} option to {\tt isis\_init\_l}}, 135
\item {\tt isis\_perror}, 83, 345
\item {\tt isis\_probe}, 7, 60, 357
\item {\tt isis\_remote}, 357
\item {{\tt isis\_remote}}, 7, 136
\item {\tt isis\_rexec}, 296
\item {\tt isis\_select}, 302, 357
\item {\tt isis\_signal}, 130, 160, 298, 302, 357
\item {\tt isis\_signal\_sig}, 161, 302, 357
\item {\tt isis\_socket}, 358
\item {\tt isis\_start\_done}, 9, 130, 357
\subitem {failure to call}, 132
\item {\tt isis\_task}, 130, 158, 357
\item {\tt ISIS\_TIMEOUT}, 209, 298, 389
\item {\tt isis\_timeout}, 165
\item {\tt isis\_timeout\_cancel}, 165
\item {\tt isis\_timeout\_reschedule}, 165
\item {{\tt isis\_transport} }, 176, 238
\item {\tt isis\_wait}, 161, 302, 357
\item {\tt isis\_wait\_cancel}, 161
\item {{\tt ISISPORT}}, 7
\item {{\tt ISISREMOTE}}, 136
\indexspace
\item {libisis1.a}, 16, 359
\item {libisis2.a}, 16, 359
\item {libisism.a}, 16
\item {libisism.a.a}, 359
\item {linking a program to ISIS libraries}, 16
\item {{\sc list} command in {\tt cmd} tool}, 311
\item {\tt lmgr}, 360
\item {load control in ISIS}, 328
\item {load monitoring program based on news service}, 308
\item {locking}, 219
\subitem {read-lock broken after failure}, 220
\subitem {unbreakable read-locks}, 222
\item {\tt log\_checkpoint}, 255
\item {\tt log\_flush}, 256
\subitem {and automatic flushing}, 259
\item {\tt log\_ignoreold}, 249
\item {\tt log\_in\_replay}, 254
\item {\tt log\_recovered}, 254
\item {\tt log\_write}, 263
\item {\em logging}
\subitem {activated from {\tt pg\_join}}, 32, 245
\subitem {automatic mode}, 247
\subitem {cleanup actions}, 254
\subitem {consistency between logged groups}, 270
\subitem {controlling recoverable states}, 256
\subitem {effects of environment}, 259
\subitem {in a transactional setting}, 233
\subitem {log file names}, 248, 268
\subitem {manual mode}, 262
\subitem {manual mode recovery}, 263
\subitem {recovery sequence}, 252
\subitem {states recovered}, 255
\subitem {well-behaved process groups}, 259
\item {long-haul communication facility}, 270
\indexspace
\item {main task}, 9, 130, 385
\item {\tt MAJORITY}, 99, 171, 350
\item {\tt MAKE\_SITE\_ID}, 346
\item {{\em matchmaker} service (example)}, 139
\item {\tt MAX\_PROCS}, 171, 294
\item {\tt MAX\_SITES}, 354
\item {\tt MAXBITS}, 354
\item {\tt mbcast}, 283
\item {messages}, 88, 137
\subitem {broadcast to a process group}, 98
\subitem {copy data into using format}, 89, 140
\subitem {copy data out from using format}, 140
\subitem {copy data out using format}, 90
\subitem {create using format}, 89, 140
\subitem {creating initially empty}, 89, 138
\subitem {defining new format types}, 147
\subitem {deleting}, 90, 91, 138, 139
\subitem {determine destinations}, 151
\subitem {determine if forwarded}, 102, 150
\subitem {determine length}, 151
\subitem {determine most recent sender}, 102, 150
\subitem {determine original sender}, 102, 150
\subitem {discarded silently due to undefined entry number}, 99
\subitem {fields}, 147
\subitem {forward to different destination}, 102
\subitem {identification number}, 151
\subitem {increment reference count}, 91, 138
\subitem {{\tt message} data type}, 347
\subitem {prepare to rescan}, 90, 147
\subitem {printing contents}, 91, 92
\subitem {read from file}, 152
\subitem {read from stream}, 152
\subitem {received by an entry}, 99
\subitem {scanning position}, 147
\subitem {write to file descriptor}, 152
\subitem {write to stream}, 152
\item {\em monitoring}, 6, 353
\item {\tt msg\_convertaddress}, 148, 350
\item {\tt msg\_convertchar}, 148, 350
\item {\tt msg\_convertlong}, 148, 350
\item {\tt msg\_convertpgroup}, 148, 350
\item {\tt msg\_convertshort}, 148, 350
\item {\tt msg\_convertsiteid}, 148, 350
\item {\tt msg\_copy}, 138, 347
\item {\tt msg\_delete}, 90, 95, 138, 347
\item {\tt MSG\_ENQUEUE}, 98, 386
\item {\tt msg\_fread}, 152, 347
\item {\tt msg\_fwrite}, 152, 347
\item {\tt msg\_gen}, 89, 140, 347
\item {\tt msg\_get}, 90, 140
\item {\tt msg\_getdests}, 151
\item {\tt msg\_getfld}, 147, 347
\item {\tt msg\_getid}, 151
\item {\tt msg\_getlen}, 151
\item {\tt msg\_getreplyto}, 347
\item {\tt msg\_getsender}, 102, 150, 347
\item {\tt msg\_gettruesender}, 102, 150, 347
\item {\tt msg\_gettype}, 147, 347
\item {\tt msg\_increfcount}, 91, 138, 347
\item {\tt msg\_isforwarded}, 102, 150, 347
\item {\tt msg\_newmsg}, 89, 95, 138, 347
\item {\tt msg\_printaccess}, 92, 147, 347
\item {\tt msg\_put}, 89, 95, 140, 347
\item {\tt msg\_putfld}, 147, 347
\item {\tt msg\_rcv}, 98, 386
\item {\tt msg\_read}, 152, 347
\item {\tt msg\_ready}, 98, 386
\item {\tt msg\_rewind}, 90, 147, 347
\item {\tt msg\_write}, 152, 347
\item {multicast transport layer}, 176, 238
\item {\tt my\_address}, 86, 347
\item {\tt my\_eid}, 350
\item {\tt my\_eventid}, 352
\item {\tt my\_host}, 84, 347
\item {\tt my\_process\_id}, 347
\item {\tt my\_site\_id}, 84, 347
\item {\tt my\_site\_incarn}, 84, 347
\item {\tt my\_site\_no}, 84, 347
\indexspace
\item {news facility}, 306, 360
\item {\tt news\_clear}, 307
\item {\tt news\_post}, 306
\item {\tt news\_posta}, 307
\item {\tt news\_subscribe}, 306
\item {Nonstandard task packages}, 387
\item {\tt NULLADDRESS}, 86, 347
\item {\tt NULLADDRESS arg. to {\tt pg\_watch}}, 118
\item {\tt nullreply}, 101, 173, 350
\item {\tt NULLSID arg. to {\tt sv\_watch}}, 117
\indexspace
\item {\tt ORIGINAL}, 355
\item {overload and congestion}, 328
\indexspace
\item {\tt paddr}, 87, 315
\item {\tt paddr\_isequal}, 346
\item {parallel make demo}, 367
\item {\tt PG\_ALEN}, 354
\item {\tt PG\_BIGXFER}, 356
\item {\tt pg\_client}, 69, 70, 111, 235, 352
\item {\tt PG\_CLIENT\_AUTHEN}, 352
\item {\tt PG\_CREDENTIALS}, 352
\item {\tt pg\_delete}, 110, 352
\item {\tt pg\_detect\_failure}, 116
\item {\tt PG\_DONTCREATE}, 352
\item {\tt pg\_getview}, 112, 170, 355
\item {{\tt pg\_getview}}, 237
\item {\tt PG\_GLEN}, 354
\item {\tt pg\_inhibit\_joins}, 352
\item {\tt PG\_INIT}, 352
\item {\tt pg\_join}, 9, 105, 120, 130, 352
\subitem {automatic logging}, 247
\subitem {automatic recovery from a log}, 252
\subitem {logging feature}, 245
\subitem {manual logging option}, 262
\subitem {{\tt PG\_CLIENT\_AUTHEN} option}, 107
\subitem {{\tt PG\_CREDENTIALS} option}, 107
\subitem {{\tt PG\_DONTCREATE} option}, 106
\subitem {{\tt PG\_INIT} option}, 106
\subitem {{\tt PG\_JOIN\_AUTHEN} option}, 107
\subitem {{\tt PG\_LOGGED} option}, 106
\subitem {{\tt PG\_MONITOR} option}, 106
\subitem {{\tt PG\_XFER} option}, 107, 120
\subitem {recovery in manual mode}, 263
\subitem {state transfer option}, 32
\subitem {use of {\tt xfer\_refused}}, 107
\item {\tt PG\_JOIN\_AUTHEN}, 352
\item {\tt pg\_join\_inhibit}, 107, 198
\item {\tt pg\_leave}, 110, 352
\item {\tt PG\_LOGGED}, 352
\item {\tt PG\_LOGPARAMS}
\subitem {option to {\tt pg\_join}}, 249
\item {\tt pg\_lookup}, 13, 111, 170, 352
\item {\tt PG\_MONITOR}, 352
\item {\tt pg\_monitor}, 115, 355
\item {\tt pg\_monitor\_cancel}, 116, 355
\item {\tt pg\_rank}, 10, 113, 352
\item {{\tt pg\_rank}}, 237
\item {\tt pg\_rank\_all}, 113
\item {\tt pg\_signal}, 114, 299, 352
\item {\tt pg\_subgroup}, 109, 120, 293, 352
\item {\tt pg\_watch}, 118, 355
\item {\tt pg\_watch\_cancel}, 119, 355
\item {\tt PG\_XFER}, 352
\item {\tt pl\_add}, 352
\item {{\tt pl\_add}}, 237
\item {\tt pl\_create}, 352
\item {{\tt pl\_create}}, 237
\item {\tt pl\_delete}, 352
\item {{\tt pl\_delete}}, 237
\item {{\tt pl\_getview}}, 237
\item {\tt pl\_lookup}, 352
\item {\tt pl\_makegroup}, 352
\item {{\tt pl\_makereal}}, 237
\item {\tt pl\_rank}, 352
\item {{\tt pl\_rank}}, 237
\item {\tt pl\_remove}, 352
\item {{\tt pl\_remove}}, 237
\item {\tt pmsg}, 91, 315
\item {pointer into message using {\t \%-X} format items}, 94, 143
\item {pointer into message using {\tt \%-X} format items}, 143
\item {port number}, 7, 336
\item {{\sc pr\_dump} command in {\tt cmd} tool}, 312
\item {{\tt prepend\_and\_discard}}, 270
\item {\tt print}, 315
\item {print an address}, 87
\item {printing contents of a groupview structure}, 10
\item {\tt proc\_watch}, 119, 355
\item {\tt proc\_watch\_cancel}, 119, 355
\item {process groups}, 2, 103, 352
\subitem {{\em address} of}, 86
\subitem {becoming a client}, 111
\subitem {cancel monitor}, 116
\subitem {cancel watch}, 119
\subitem {creating subgroups}, 109
\subitem {delete}, 110
\subitem {determining address}, 111
\subitem {enabling logging during {\tt pg\_join}}, 106
\subitem {get group view}, 112
\subitem {get rank of member or client}, 113
\subitem {initial view after join}, 132
\subitem {joining and creating}, 105
\subitem {large}, 293
\subitem {leaving}, 110
\subitem {monitor triggered for first time}, 132
\subitem {monitoring for total failure}, 116
\subitem {monitoring membership}, 114, 115
\subitem {rank of member}, 113
\subitem {send UNIX-style signal}, 114
\subitem {sequence of events during {\tt pg\_join}}, 107
\subitem {state transfer}, 119, 122, 132
\subitem {symbolic names}, 3
\subitem {symbolic names for}, 104
\subitem {test for membership}, 113
\subitem {watch member}, 118
\subitem {watching a non-member}, 119
\subitem {watching individual members}, 114
\item {process lists}, 105, 237
\item {process}
\subitem {{\em address} of}, 86
\subitem {cancel watch}, 119
\subitem {watching non group member}, 119
\item {protection}, 107, 111, 113
\item {protocols process dump}, 321
\subitem {creating due to ISIS system crash}, 321
\subitem {creating using {\tt cmd} tool}, 321
\subitem {creating using {\tt kill} command}, 321
\subitem {name of log file}, 321
\item {\tt protos}, 360
\item {{\tt prstat} command for monitoring ISIS}, 326
\indexspace
\item {recovery manager}, 309
\item {redundant computation}, 192
\subitem {in a transactional setting}, 231
\subitem {use of replicated data in}, 206
\item {remote execution}, 295
\item {remote procedure call}, 3
\item {replicated data}, 185
\subitem {configuration data}, 187
\subitem {logging}, 189
\subitem {sender holds mutual exclusion}, 189
\subitem {state transfer issues}, 189
\item {\tt reply}, 350
\item {reply to a message}, 100
\item {\tt reply}
\subitem {long form}, 173
\subitem {short form}, 2, 100, 172
\item {\tt reply\_l}, 350
\item {{\sc rescan} command in {\tt cmd} tool}, 312, 337
\item {{\tt rexec} utility program}, 295, 360
\item {\tt rmgr}, 360
\item {\tt rmgr\_cmd}, 310, 360
\item {\tt rmgr\_register}, 310
\item {\tt rmgr\_unregister}, 310
\item {\tt rmgr\_update}, 309
\indexspace
\item {{sc group} command}, 311
\item {scope}
\subitem {for namespace searches}, 104
\subitem {specification in {\tt sites} file}, 336
\item {{\sc send} command in {\tt cmd} tool}, 312, 313
\item {setting up ISIS}, 333
\item {{\sc shutdown} command in {\tt cmd} tool}, 312, 343
\item {{\tt SIGOVERFLOW} sent during congestion}, 329
\item {simulating a multi-computer network}, 300
\item {site views}
\subitem {cancel monitor}, 115
\subitem {cancel watch request}, 117
\subitem {monitoring}, 114
\subitem {monitoring continuously}, 114
\subitem {structure defined}, 85
\subitem {watching individual members}, 117
\item {\tt site\_id}, 345
\subitem {macros for manipulating}, 84
\subitem {site numbers}, 84
\subitem {type definition}, 84
\item {\tt SITE\_INCARN}, 346
\item {\tt SITE\_IS\_UP}, 346
\item {\tt SITE\_NO}, 346
\item {{\sc sites} command in {\tt cmd} tool}, 311
\item {\tt sleep}, 155
\item {source directories}, 339
\item {{\tt spool}}, 270
\item {{\tt spool\_advise}}, 270
\item {{\tt spool\_and\_discard}}, 270
\item {{\tt spool\_cancel}}, 270
\item {{\tt spool\_discard}}, 270
\item {{\tt spool\_getseqn}}, 270
\item {{\tt spool\_inquire}}, 270
\item {{\tt spool\_m}}, 270
\item {{\tt spool\_m\_and\_discard}}, 270
\item {{\tt spool\_play\_through}}, 270
\item {{\tt spool\_replay}}, 270
\item {{\tt spool\_set\_ckpt\_pointer}}, 270
\item {{\tt spool\_set\_replay\_pointer}}, 270
\item {{\tt spool\_wait}}, 270
\item {spooler}, 270
\item {start sequence for ISIS clients}, 130
\item {state transfer}, 28, 32, 119, 356
\subitem {enabling during {\tt pg\_join}}, 120
\subitem {enabling in subgroups using {\tt allow\_xfers\_xd}}, 121
\subitem {examples}, 125
\subitem {failure during}, 123, 130
\subitem {flush output buffers}, 122
\subitem {large transfers}, 124
\subitem {locator}, 32, 121, 122, 123, 126
\subitem {mechanism discussed}, 32, 122
\subitem {ordering relative to other events}, 124
\subitem {receive routine}, 32, 122
\subitem {refuse to send state}, 123
\subitem {refuse to transfer state}, 122
\subitem {send piece of state}, 121
\subitem {transfer an array}, 125
\subitem {transfer domains}, 124
\subitem {transfer queue of messages}, 127
\subitem {{\em virtual synchrony} and}, 37
\item {STREAMS}
\subitem {accessed from ISIS}, 302
\item {subdivided computation}
\subitem {basic approaches}, 201
\subitem {example}, 202
\subitem {use of configuration in}, 202
\item {suntools applications}, 305
\item {\tt sv\_getview}, 85, 355
\item {\tt sv\_monitor}, 114, 355
\item {\tt sv\_monitor\_cancel}, 115, 355
\item {\tt sv\_watch}, 117, 355
\item {\tt sv\_watch\_cancel}, 117, 355
\item {\tt sview}, 353
\item {{\tt sview} type definition}, 85
\item {synchronization}
\subitem {by token passing}, 211
\subitem {locks}, 219
\subitem {token tool}, 218
\indexspace
\item {\tt t\_fork}, 357
\subitem {t\_fork\_urgent}, 157
\item {\tt t\_fork\_msg}, 357
\item {\tt t\_fork\_urgent}, 161, 357
\item {\tt t\_holder}, 219
\item {\tt t\_on\_sys\_stack}, 163, 357
\item {\tt t\_pass}, 219
\item {\tt t\_request}, 218
\item {\tt t\_scheck}, 157, 357
\item {\tt t\_set\_stacksize}, 357
\item {\tt t\_setstacksize}, 163
\item {\tt t\_sig}, 158, 357
\item {\tt t\_sig\_all}, 158, 357
\item {\tt t\_sig\_urgent}, 158, 161, 357
\item {\tt t\_wait}, 158, 357
\item {\tt t\_wait\_l}, 357
\item {\tt t\_wait\_l()}, 319
\item {\tt t\_waiting}, 159
\item {\tt TAKEOVER}, 355
\item {tasks}, 5, 153, 356
\subitem {blocking}, 5
\subitem {blocking system calls}, 155
\subitem {calling routines that might overflow the stack}, 163
\subitem {creation}, 157
\subitem {example of bug due to non-reentrant code}, 164
\subitem {limits on stack size}, 5
\subitem {non-preemptive scheduling}, 155, 160
\subitem {recognizing infinite loops}, 156
\subitem {{\tt sleep} redefined}, 155
\subitem {stack limit}, 157, 162
\subitem {stack limit in}, 6
\subitem {symbolic names}, 158
\subitem {{\tt t\_wait}, {\tt t\_wait\_l} and {\tt t\_sig}}, 158
\subitem {urgent fork and signal}, 161
\item {TCP}
\subitem {making a connection}, 302
\item {thrashing}
\subitem {causes of}, 326
\subitem {effect on ISIS performance}, 326
\item {time-card service}, 1
\item {\tt tk\_connect}, 125, 302, 357
\item {token tool}, 218
\item {transactions}, 225, 358
\subitem {bank demo}, 365
\subitem {with home-brew transactional files}, 232
\item {twenty questions demo}, 363
\indexspace
\item {UDP port number already in use}, 341
\indexspace
\item {{\tt vfork} within ISIS applications}, 299
\item {virtual synchrony}
\subitem {atomicity of broadcast deliveries}, 178
\subitem {definition}, 26
\subitem {global event ordering}, 24, 177
\subitem {virtual failures}, 178
\item {{\tt vmstat} contrasted with {\tt prstat}}, 326
\indexspace
\item {\tt W\_FAIL}, 117, 354
\item {\tt W\_JOIN}, 118, 354
\item {\tt W\_LEAVE}, 118, 354
\item {\tt W\_RECOVER}, 117, 354
\item {watch}, 353
\indexspace
\item {\tt x\_abort}, 226, 359
\item {\tt x\_begin}, 226, 359
\item {\tt x\_commit}, 226, 359
\item {\tt x\_getid}, 227, 359
\item {\tt x\_id}, 358
\item {\tt x\_item}, 358
\item {\tt x\_list}, 358
\item {\tt x\_outcomes}, 229, 359
\item {\tt x\_outcomes\_flush}, 230, 359
\item {\tt x\_term}, 228, 359
\item {XDR encodings}
\subitem {used within ISIS}, 150
\item {\tt xdr\_getpos}
\subitem {size of an XDR data object}, 150
\item {\tt xdr\_resize}
\subitem {change size of an XDR data object}, 150
\item {\tt xdrmem}
\subitem {XDR stream used with memory object}, 150
\item {\tt xfer\_flush}, 122, 356
\item {\tt xfer\_out}, 121, 356
\item {\tt xfer\_refuse}, 122
\item {{\tt xfer\_refused}}, 107
\item {\tt xid\_cmp}, 227, 359
\item {\tt xmgr}, 360
\indexspace
\item {{\tt Z} option to {\tt isis}}, 343
\end{theindex}
{failure during}, 123, 130
\subitem {flush output buffers}, 122
\sjunk.tex 666 420 420 2101 4372076065 5621 \documentstyle[11pt,twoside]{report}
\makeindex
\title{The ISIS System Manual, Version 1.1}
\author{K. Birman, R. Cooper, T. Joseph, K. Kane and F. Schmuck \\ \\
\copyright 1988 by The ISIS Project \footnote
{The ISIS system was developed with support from the Department of Defense
Advanced Research Projects Agency, DARPA, under ARPA order 5378, contract MDA903-85-C-0124 and
ARPA order 6037, contract N00140-87-C-8904. The information contained in this manual
is unclassified.
}
}
\newcommand{\newtextwidth}[3]{
\setlength{\textwidth}{#1}
\setlength{\columnwidth}{#1}
\setlength{\hsize}{#1}
\setlength{\linewidth}{#1}
\setlength{\oddsidemargin}{#2}
\setlength{\evensidemargin}{#3}
}
\newcommand{\hrf}{\ \hrulefill\ }
\newcommand{\Marginpar}[1]{\marginpar[\raggedright #1]{\raggedleft #1}}
\begin{document}
\chapter{Advanced Facilities}
\markboth
{\hrf \rm Advanced Facilities\hrf \bf Chapter \thechapter}
{\bf Chapter \thechapter\hrf \rm Advanced Facilities\hrf}
\section{Using ISIS in an X-windows program}
\input{xwindows.tex}
\end{document}
bitem {basic approaches}, 201
\subitem {example}, 202
\subitem {use of configuration in}, 202
\item {suntools applications}, 305
\item {\tt sv\_getview}, 85, 355
\item {\tt sv\_monitor}, 114, 355
\item {\tt sv\_monitor\_cancel}, 115, 355
\item {\tt sv\_watch}, 117, 355
\item {\tt sv\_watch\_cancel}, 117, 355
\item {\tt sview}, 353
\item {{\tt sview} type definition}, 85
\item {synchronization}
\subitem {by token passing}, 211
\subitem large.tex 666 420 420 12403 4433562271 5767 The basic process group mechanisms work well provided that the size of a group
is kept moderately small. In the present version of ISIS, this means that a process
\index{process groups, large}\index{hierarchical process groups}
group of 8, 10, and perhaps as many as 20 processes will work without problems,
and as ISIS is tuned, the limit will certainly rise.
But, what if an application requires much larger numbers of processes in a group,
say 100 or 500?
Such applications are best dealt with by turning to a hierarchical design methodology,
in which the group is actually implemented by a group of {\em root-processes}
each of which represents some group of {\em node-processes}.
One way to solve this sort of problem is to create a large parent group, but
subdivide it into subgroups within which the work actually takes place.
This is only viable if broadcasts to the entire set of processes are infrequent.
If you do undertake to implement a mechanism like this,
and it involves dynamic creation or deletion of subgroups of the parent
group, be sure to use {\tt pg\_subgroup} and {\tt pg\_delete}
rather than having all the members do {\tt pg\_join} and {\tt pg\_leave}
requests in parallel. The cost of a single {\tt pg\_subgroup} operation\index{\tt pg\_subgroup}
is {\em much} smaller than for a lot of {\tt pg\_join} operations done
in parallel, and your application will perform noticeably better.
{\tt pg\_subgroup} is quite fast, and can be used fairly casually.
An alternative approach is to pick a single process as the representative from each
subgroup to the root group.
The node-processes would be organized into sets of 10 or 20 members, electing one
among their members to be the delegate to the root-process group, which would itself
have on the order of 10 to 20 members.
The idea is that requests to the entire membership would be transmitted first to the
root group, and then the members of that group would pass the request down to their
respective node-process groups for local action.
How can this scheme be made fault-tolerant?
There are two aspects to this question.
First, one needs to make sure that a node-process group always has a delegate
to the root-group.
This is easily arranged by simply having the node-processes monitor their
group membership.
If a process discovers itself to be the lowest-ranked member, e.g. \\
{\centering {\tt if(addr\_ismine(\&gview->pg\_members[0]) ...}} \\
then it joins the root group.
The second problem is a little more tricky, namely to ensure that all requests to
the root group will be propagated to the node processes regardless of the failure of
one or more root processes.
A simple alternative is for the root processes to maintain a log of recent
requests that they have seen and to use state transfer to ship this log to any
member that joins.
Since broadcasts to the node processes are atomic, the joining node-process can
simply sift through this list, discarding any requests that have already been forwarded
to it.
It should then forward any remaining requests; these are ones that the root process
failed to forward because it was down when they arrived.
A more complex scheme is to use a coordinator-cohort computation to implement the forwarding
of requests.
For example, the root processes could just back up one among their membership, who would
form an address list including all the node process groups and broadcast requests to it.
The disadvantage here is that the coordinator must never broadcast to
more than {\tt MAX\_PROCS} processes at one time, which may force it to do multiple\index{\tt MAX\_PROCS}
broadcasts to avoid exceeding this limit (currently, 128).
More important, the approach would be potentially costly, since all requests would
now be vectored through a single primary process.
A more complex solution is to have each root-process broadcast to its node-group,
as above, but to use the others as cohorts for that broadcast, thus giving
fault-tolerance and also sharing the work.
This solution is quite satisfactory, but tricky to implement.
Total failures of node-groups or the root group are also potentially hard to handle.
If the root group goes down, it presumably becomes temporarily unavailable and
can later recover from logs by using {\tt PG\_LOGGED} as an option to the
{pg\_join} operation.
If a node group goes down and others remain up, however, it would seem necessary to
spool requests for it, replaying them on recovery.
This function would have to be implemented by the members of the root process group.
As the above discussion makes clear, there are certain fairly stylized ways to
implement hierarchical process groups.
Eventually, we believe that a hierarchical group tool can be built that will
encapsulate these functions, while also providing mechanisms for such common
aspects as maintaining replicated data, dealing with network partitioning in a
hierarchical group context, and so forth.
The current scheme, as outlined above, lends itself to solutions of simple
but large applications.
Large complex applications with demanding consistency and reconfiguration constraints
can be solved using these methods, but doing so
requires a high level of programmer sophistication and experience with ISIS.
For the present, most programmers should probably
treat such large applications as being beyond the scope of the initial ISIS system.
m much} smaller than for a lot of {\tt pg\_join} operations done
in parallel, and your application will perform noticeably better.
{\tt pg\_subgroup} is quite fast, and can be used fairly casually.
An alternative approach is to pick a single process aslisp.tex 666 372 212 67020 4675160612 5657 This describes the initial version of the Common Lisp interface to ISIS.
It is only available from Allegro Common Lisp, although later other
Common Lisps which allow calls into C code may be supported.
The ISIS interace is provided as a package (and a module) called 'isis.
This interface is experimental may change significantly before final
release. We welcome comments and suggestions from lisp experts!
\section{Installation}
ALLEGRO\_DIR in SUN3/makefile
Read in the release tape (this is a complete ISIS release). "cd"
to the SUN3 or SUN4 directory to build the Allegro ISIS interface for
either the Sun 3 or Sun 4 architecture respectively. Edit the makefile
(either SUN3/makefile or SUN4/makefile) and change the ALLEGRO\_DIR macro to
point the Allegro directory on your system to enable the file
"\$(ALLEGRO\_DIR)/lib/misc/lisp.h" to be accessed by the C compiler. Type
"make" to build the C/Fortran ISIS system. Then type "make allegro" to
build the Allegro-specific C libraries in the sub-directory
SUN3/allegro\_clib or SUN4/allegro\_clib. Then "cd" to allegro\_clib, startup
allegro and type ":ld make-allegro" which will compile the lisp source code
in the files: isis.cl, isis-task.cl, isis-msg.cl, isis-tools.cl and
isis-c-load.cl. These are actually symbolic links to the clib source
directory.
You should add SUN3/allegro\_clib or SUN4/allegro\_clib to your load/require
path. To use ISIS from Allegro just (require 'isis) or (load 'isis). This
should take about 2 minutes. Most of the time is spent loading in the
foreign libraries libisis1.a, libisis2.a and libisism.a.
NOTE: Make sure you type (isis-init {\em your-port-no}) followed by
(isis-start-done) to ensure that the ISIS world is initialized properly.
\section{Argument passing conventions}
There is a straightforward convention for calling most ISIS functions from
lisp. The lisp name of an ISIS function is derived from the C name by
replacing underscores by minus signs. Most scalar C types have simple
equivalents in lisp. Vector C types have corresponding simple-vector lisp
types. Since vectors of addresses or messages are so common in ISIS
programs, the ISIS lisp interface presents these as lists of address or
message pointers in most functions. The general conventions are summarized
in the following table:
\begin{verbatim}
C type Corresponding lisp type
long or int fixnum (which has a shorter range than C ints)
short fixnum
char fixnum (except for exceptions noted below)
char * simple-string
long [] (simple-array (signed-byte 32) *))
short [] (simple-array (signed-byte 16) *))
char [] (simple-array (signed-byte 8) *))
(except for exceptions noted below)
address * integer (actually a pointer to a cstruct)
message * integer (actually a pointer to a cstruct)
condition * integer (actually a pointer to a cstruct)
address [] list (of pointers to address cstructs)
int (*)() lisp function object (the ISIS lisp interface
calls ff:register-function on such arguments
before passing them to C)
\end{verbatim}
The result from most ISIS functions is an integer with zero or positive
indicating successful return and -1 indicating an error. This behaviour
is retained for most of these functions (with exceptions noted below).
Some ISIS functions return a C truth value (0 for FALSE and positive
[typically 1] for TRUE). In lisp these become nil and t respectively.
There is not enough precision going from fixnum to long, and a potential
loss of precision going from fixnum to short or char (in fact I'm not
sure that short or char by-value arguments work at all). Arrays do not
lose precision because of an artifact of the Allegro implementation which
allows full 32, 16 and 8 bit values to be stored satisfactorily in lisp
arrays.
Addresses in lisp are always integer pointers into C space (malloc'ed)
cstructs. Once created the address objects these pointers refer to
should not be freed explicitly by the lisp programmer.
Fields of the address struct can be accessed using address-entry,
address-site-id, etc. as expected.
Groupviews in lisp are pointers into C space cstructs. They should not
be freed by the lisp programmer either. Accessors for useful fields
of groupview are provided. Note that groupview-members and groupview-clients
return lists (not vectors) of addresses.
Siteviews are much that same as groupviews. They should not be freed.
Conditions in lisp are pointers into C space, although there
are no accessor functions. They can be freed by calling isis:free.
Messages in lisp are also pointers into C space. They can be deleted
using isis:msg-delete just as from C programs.
Although user-defined call-back functions are registered using
ff:register-function, user-defined argument objects are NOT registered
using ff:register-value. You have to do this yourself if the object might
move during garbage collections. The ISIS interface provides two functions,
object-to-id and id-to-object, that are more convenient for this purpose.
Also note that user call-back functions are called with one argument
descriptor object through which the C arguments must be accessed using
ff:foreign-argument, or by defining the call-back function using
ff:defun-c-callable.
\section{ISIS Functions}
This is a list of all the functions exported by the 'isis
package, and a synopsis of their arguments and results. If the only result
from a function is a fixnum return code it is not described. If the function
does not correspond closely correspond to one in the current ISIS manual
a short description is given.
\begin{verbatim}
Function Arguments Result
isis-init (port-nr)
isis-start-done ()
address (site incarn process entry) integer
address-process (addr) fixnum
address-groupid (addr) fixnum
address-port-no (addr) fixnum
address-cluster (addr) fixnum
address-site (addr) fixnum
address-incarn (addr) fixnum
address-entry (addr) fixnum
address-type (addr) fixnum
save-address (addr) integer
"Make the address argument permanent, if necessary
by making a copy of it. Useful if the address originally
came from a groupview or a message as this would
disappear when the groupview or message was deleted."
my-address () integer
my-process-id () fixnum
my-site-id () fixnum
my-site-incarn () fixnum
my-site-no () fixnum
my-host () string
site-name () string
nulladdress () integer
addr-isequal (a1 a2) (or t nil)
addr-isnull (a1) (or t nil)
addr-ismine (a1) (or t nil)
addr-cmp (addr1 addr2) fixnum
alist-len (addr-vector) fixnum
"addr-vector is a null terminated C vector of addresses.
Result is its length"
condition () integer
t-sig (cond value)
"The value IS registered using object-to-id"
t-sig-all (cond value)
"The value IS registered using object-to-id"
t-sig-urgent (cond value)
"The value IS registered using object-to-id"
t-wait (condition &optional (why "")) value
"The returned value is retrieved using id-to-object"
t-fork (function &optional (arg 0))
t-fork-urgent (function &optional (arg 0))
t-fork-msg (function msg)
t-fork-urgent-msg (function msg)
isis-task (handler-function name)
"handler is a lisp function object. The ISIS lisp interface
will register the function using ff:register-function.
name is a simple-string"
isis-entry (entry-no handler-function &optional name)
"name is a simple-string"
defun-isis-entry (function-name entry-no (msg-arg) &body body)
"Experimental macro based on defun-c-callable for
defining isis-entry routines. If successful macros
for all ISIS call-backs could be defined."
isis-mainloop (main-function &optional (arg 0))
"NOTE that you should ff:register-value or
isis:object-to-id the optional argument to avoid problems
with it moving upon garbage collections"
isis-input (file-des input-function &optional (arg 0))
isis-input-sig (file-des cond value)
"The value IS registered using object-to-id"
isis-signal (sig-no signal-function &optional (arg 0))
isis-accept-events (flag)
isis-timeout (milliseconds function &optional (arg 0))
sleep (seconds)
isis-sleep-ms (ms)
t-suspend ()
msg-copy (msg) integer
msg-delete (msg)
msg-newmsg () integer
msg-getlen (msg) fixnum
msg-getreplyto (msg) integer
msg-getsender (msg) integer
msg-gettruesender (msg) integer
msg-getdests (msg) (list integer)
msg-setdest (msg addr)
msg-setdests (msg addr-list)
msg-isforwarded (msg) (or t nil)
msg-ismsg (msg) (or t nil)
msg-msgcheck (msg) (or t nil)
message (&rest arg-list) integer
"Constructs a new message and inserts each argument
into it using msg-put-any. Each argument is either
a single value or a (value type) pair.
msg-put-any (msg value &optional field-type) integer
"Inserts value into msg using a field type specified
by field-type. If field-type is omitted it is inferred form the type
of value. The result is msg"
The following table describes the correspondence between field types,
lisp value types, and C format letters. If the field-type is specified
it most be one of the atoms in the first column. If it is omitted
the type of the value must be in the second column. In both cases
the field inserted in the message corresponds to the C format letter(s)
given in the third column.
Field type (type-of value) Format letters (as used from C)
'long fixnum or long "l" or "d"
'short short "h"
'byte byte-8 "c"
'char char "c"
'string string "s"
'address "A[1]"
'message "m"
'bitvec "B[1]"
'event "E[1]"
'groupview "P[1]"
'long-vector long-vector "L" or "D"
'short-vector short-vector "H"
'byte-vector byte-vector "C"
'address-list (list integer) "A"
Note that a list of integers is interpreted as a list of address pointers.
This is a dangerous assumption and may be removed if problems are caused."
message-list (msg &rest type-list-or-number) list
"If the second argument is a number, message-list
extracts that many arguments using msg-any.
Otherwise the second argument is a list of types,
and msg-any is called once for each type extracting
a value of that type from the message.
The result is a list of these extracted values."
msg-any (msg &optional type) lisp
"Extracts the next field from msg. If type is specified the field must
be compatible with that type. If type is omitted a type based on
the message field is used as follows:
Format letters (used in C code) default Lisp type
"c" or "s" string
"h" short
"l" "d" "D" "L" long
"a" "A" address
"m" message
"b" "B" bitvec
"p" "P" groupview
"e" "E" event
Note that no distinction is made between vectors and individual values
the field types. A later version of msg-any may choose to return a
vector of values when the number of elements in the field is more
than one. (In any case a vector of one element and an individual are
represented identically in messages)."
msg-put-long (msg value) integer
"Inserts value as a long field in msg.
Result is msg"
msg-put-short (msg value) integer
msg-put-byte (msg value) integer
msg-put-char (msg value) integer
msg-put-string (msg value) integer
msg-put-address (msg value) integer
msg-put-message (msg value) integer
msg-put-long-vector (msg vector) integer
msg-put-short-vector (msg vector) integer
msg-put-byte-vector (msg vector) integer
msg-put-address-list (msg addr-list) integer
msg-put-bitvec (msg value) integer
msg-put-event (msg value) integer
msg-put-groupview (msg value) integer
msg-long (msg) integer
"Extracts the next field from msg as a long"
msg-short (msg) fixnum
msg-byte (msg) fixnum
msg-char (msg) string-char
msg-string (msg) simple-string
msg-address (msg) integer
msg-message (msg) integer
msg-bitvec (msg) integer
msg-event (msg) integer
msg-groupview (msg) integer
msg-long-vector (msg) long-vector
msg-short-vector (msg) short-vector
msg-byte-vector (msg) byte-vector
msg-address-list (msg) (list integer)
"Result is a list of address pointers"
msg-getscantype (msg) fixnum
"Returns the field-type of the field which will be
extracted by the next msg-get-any"
msg-increfcount (msg)
msg-gettype (msg field-name position) fixnum
"Returns the field-type of the specified instance of
field field-name"
msg-printfield (msg field-type position)
"2nd arg is actually an unsigned-byte!"
msg-rewind (msg)
msg-fread (file-pointer) integer
"file-pointer must be a pointer to a cstruct FILE"
msg-fwrite (file-pointer msg)
msg-read (file-desc) integer
msg-write (file-desc msg)
bcast (dest msg
&key ((:replies n-wanted) 0)
((:exclude-sender exclude))
((:lazy lazy)))
"dest is a list specifying the destination of the atomic broadcast.
it is in one of the following forms:
( addr [ entry ] ) -- single address and optional entry number
( ( addr+ ) [ entry ] ) -- address list and optional entry number
msg is the message to be sent.
:replies n-wanted may be a non-zero integer or one of 'all or 'majority.
:exlude-sender and :lazy are booleans
The result is a list of the received messages if the broadcast was
successful (nil if no replies), or -1 if the broadcast failed. (-1 is
kind of non-lisp-ish!)"
abcast (...)
"Atomic broadcast, (same as bcast)"
cbcast (...)
"Causal broadcast"
gbcast (...)
"Group broadcast"
fbcast (...)
"FIFO broadcast"
reply (received-msg reply-msg
&key ((:cc addr-list))
((:exclude-sender exclude)))
":cc addr-list is an optional list of destinations to
receive copies of the reply.
:exclude-sender is a boolean"
abortreply (msg)
nullreply (msg)
flush ()
forward (in-msg addr entry out-msg)
simple-abcast (...) fixnum
"The regular C abcast function. Pretty hard to call from
lisp because you have to get the varargs right!"
simple-cbcast (...) fixnum
simple-fbcast (...) fixnum
simple-gbcast (...) fixnum
simple-bcast (...) fixnum
simple-reply (...)
pg-join (gname
&key ((:dontcreate dontcreate)) ; t or nil
:logged (fname replay-entry-no end-replay-function)
:init (init-function)
:monitor (monitor-function &optional arg)
:xfer (domain-no send-function receive-function)
:bigxfer bigxfer ; t or nil
:credentials credentials-string
:join-authen join-authen-function
:client-authen client-authen-function)
pg-subgroup (pgaddr sgname incarn members clients)
"members and clients are lists of addresses"
pg-join-inhibit (flag)
"flag is a lisp boolean"
pg-getlocalview (gaddr) integer
pg-getview (gaddr) integer
pg-local-lookup (gname) integer
pg-rank (gaddr paddr) fixnum
pg-rank-all (gaddr paddr) fixnum
pg-addclient (gaddr paddr)
pg-delclient (gaddr paddr)
pg-delete (gaddr)
pg-leave (gaddr)
pg-lookup (gname) integer
pg-monitor (gaddr monitor-function &optional (arg 0))
pg-monitor-cancel (pwid)
pg-signal (gaddr sig-no)
pg-unmonitor (gaddr)
pg-client (gaddr credentials-string)
pg-watch (gaddr paddr event watch-func &optional (arg 0)) fixnum
"event is one of 'join or 'leave
Result is watch-id"
proc-watch (paddr watch-func &optional (arg 0)) fixnum
"Result is watch-id"
pg-watch-cancel (wid)
proc-watch-cancel (wid)
xfer-flush ()
xfer-out (locator msg)
allow-xfers-xd (gname domain-no send-function)
groupview-viewid (gview) fixnum
groupview-incarn (gview) fixnum
groupview-flag (gview) fixnum
groupview-gaddr (gview) integer
groupview-joined (gview) integer
groupview-departed (gview) integer
groupview-name (gview) simple-string
groupview-members (gview) (list integer)
"The result is a list of address pointers"
groupview-clients (gview) (list integer)
sv-monitor (monitor-function &optional (arg 0)) fixnum
"Result is monitor-id"
sv-watch (site-id event watch-function &optional (arg 0)) fixnum
"event is one of 'fail or 'recover
Result is watch-id"
sv-monitor-cancel (mid)
sv-watch-cancel (wid)
site-getview () integer
"Result is a pointer to the siteview, which can be
accessed by the following functions"
siteview-viewid (sview) fixnum
siteview-slist (sview) integer
"Result is a pointer to an unsigned short C vector.
Access using c-short-vector-ref."
siteview-incarn (sview) integer
"Access result using c-byte-vector-ref."
siteview-failed (sview) integer
"Result is a pointer to an ISIS/C bitvec.
We need to add access functions for these bitvecs."
siteview-recovered (sview) integer
"Result is a pointer to an ISIS/C bitvec."
isis-rexec (n-wanted gaddr sites prog args env user passwd)
"sites is a list of addresses,
args and env are (vector simple-string)'s
and user and passwd are simple-strings.
Result is a list of addresses where the programs were
started up"
news-apost (site-vector subject msg back)
"site-vector is a null-terminated C site-id vector"
news-cancel (subject)
news-clear (site-vector subject)
news-clear-all (site-vector subject)
news-post (site-vector subject msg back)
news-subscribe (subject entry-no back)
coord-cohort (msg gaddr action-func got-reply-func &optional choose-func)
cc-terminate (msg)
"Unlike the C version, this takes a single message"
isis-logentry (gaddr entry-no)
log-checkpoint (gname)
log-flush (gaddr)
log_recovered (gaddr)
log-write (gaddr msg)
log-recovered (gaddr) (or t nil)
rmgr-register (key-string)
rmgr-unregister ()
rmgr-update (key-string prog-name arg-vector env-vector)
"arg-vector and env-vector are (vector simple-string)'s"
x-abort ()
x-begin ()
x-commit (phase)
x-term (participant-name
on-prepare-func on-commit-func on-abort-func msg)
x-getid () integer
"Returns a transaction id pointer (which is exactly
the same as an address pointer."
x-outcomes (participant-name) integer
"Returns a pointer to an xlist cstruct which can be
accessed by the following functions."
x-list-len
x-list-items-id (xlist index) integer
"Result is pointer to x-item cstruct at given index"
x-item-id (xitem) integer
"Doesn't work"
x-item-outcome (xitem) fixnum
x-item-info (xitem) integer
x-outcomes-flush (participant-name x-list-ptr)
t-holder (gaddr name) integer
t-pass (gaddr name)
t-request (gaddr name pass-on-fail)
"pass-on-fail is a lisp boolean"
paddr (addr)
pmsg (msg)
pentry (addr entry-no)
"Prints the address/entry-no pair symbolically"
pr-dump (dump-level)
isis-print (...)
"ISIS print function."
isis-perror (message)
"A simple-string message only, with no following varargs."
free (pointer)
"Pointer is an integer address into C space.
Don't call free on address or message pointers.
Don't call free on groupview or siteview pointers.
Actually unless you're experiencing serious memory
leaks in a long-lived program don't call this at all!"
The following types and accompanying constructors and accessors have been
defined for use with the message insert and extract functions:
(deftype long () '(signed-byte 32))
(deftype long-vector (&optional n) `(simple-array long (,n)))
make-long-vector (size) long-vector
make-long () integer
long-value (l) integer
\end{verbatim}
Similarly for short, short-vector, byte-8, and byte-vector.
These are lisp-space garbage-collected objects which are safe to pass
to the message functions (during which garbage collections can't occur),
but not most of the other isis functions which may block or call back
into lisp.
There are also some functions which manipulate C-space vectors and
scalar variables. These are safe to pass to blocking ISIS functions,
and are used internally by the ISIS lisp interface. You should have
little need to use these for any of the above ISIS functions. But if
you modify the lisp interface, or you are writing your own foreign-function
code, you might have use for them. See the file isis-msg.cl for details
if these are useful to you.
\section{Shortcomings with this release}
It is rather easy for the lisp user/programmer to type an incorrect ISIS
call which will cause a segment violation or similar. This is unlikely to
destroy the lisp world, but is very likely to destroy ISIS for that lisp
session. Even if no errors occur the user may wish to restart an ISIS
computation several times within the same lisp session. This is really a
mismatch between C and lisp usage in which C programs are basically unsafe,
but easy and quick to restart, while lisp sessions which detect and report
almost all user/programmer errors but are long-lived, and slow to restart.
When an ISIS/C error occurs, or if an ISIS computation wants to be run
again the lisp user must end the lisp session and begin again.
dumplisp can be of help here. Note that it will not work to load ISIS
again since foreign code (i.e. C code) cannot be re-loaded.
Long term solutions to this include allowing ISIS to reinitialize itself,
by resetting all the C variables, clearing multi-tasking locks, and
informing the rest of the distributed ISIS environment that this
computation had terminated and a new incarnation was beginning. This would
not of course recover from serious corruptions of C memory. We may also
permit individual lisp tasks to be killed or frames to be unwound.
In the current release, unwinding an individual stack which contains ISIS
frames will not release locks and may cause ISIS to deadlock for that
lisp session.
Address and messages both appear as integers (really pointers into C space).
This is non-intuitive to the lisp user, and it would be better if they
were presented as a new type which encapsulates the C pointer. In the
meantime if you cannot remember whether large integer value, x, really was
an address or a message, call (paddr x) or (pmsg x). If either of these
gives plausible output, this is a hint to the type of value pointed to.
The performance is worse than we had hoped. Lisp "lightweight" task/process
switches take about 1 or 2 milliseconds. ISIS uses tasks a lot, on the
assumption that they are fairly cheap. We may optimize some tasks switches
away in a later release, but most are fundamental to the way the system
operates. We have encountered some bugs in the Allegro tasking mechanism
and to work around them has cost something in performance as well. When
these problems are resolved with Franz Inc. we expect performance to
improve.
nction."
isis-perror (message)
"A simple-string message only, with no following varargs."
free (pointer)
"Pointer is an integer address into C space.
Don't call free on address or message pointers.
Don't call free on groupview or siteview pointers.
Actually unless you're experiencing serious memory
leaks in a long-lived program don't call this at all!"
The following types and logex1.tex 666 420 420 3517 4431564775 6073 \begin{verbatim}
#include "isis.h"
#define TYPE_A_ENTRY 1
#define TYPE_B_ENTRY 2
#define QUERY_ENTRY 3
main()
{
int service_maintask(), rec_type_A_update();
int rec_type_B_update(), rec_query();
isis_init();
/* Declare tasks and entry points */
isis_task(service_maintask, "service_maintask");
isis_entry(TYPE_A_ENTRY, rec_type_A_update,
"rec_type_A_update");
isis_entry(TYPE_B_ENTRY, rec_type_B_update,
"rec_type_B_update");
isis_entry(QUERY_ENTRY, rec_query, "rec_query");
isis_mainloop(service_maintask);
}
service_maintask()
{
address *gaddr;
/* Join the server group and setup logging */
gaddr = pg_join("sample_group",
PG_INIT,
PG_LOGGED, , 0,
L_AUTO, NULL,
PG_XFER, 1, ,
,
0);
/* Log the entry points that receive updates */
isis_logentry(gaddr, TYPE_A_ENTRY);
isis_logentry(gaddr, TYPE_B_ENTRY);
isis_start_done();
}
rec_type_A_update(msg_p) /* Receive a type A update */
message *msg_p;
{
/* Process a type A update */
.
.
.
}
rec_type_B_update(msg_p) /* Receive a type B update */
message *msg_p;
{
/* Process a type B update */
.
.
.
}
rec_query(msg_p) /* Receive a query */
message *msg_p;
{
/* Process a query */
.
.
.
}
\end{verbatim}
and are used internally by the ISIS lisp interface. You should have
little need to use these for any of the above ISIS functions. But if
you modify the lisp interface, or you alogex2-1.tex 666 420 420 576 4253771513 6204 \begin{verbatim}
gaddr = pg_join("sample_group",
PG_INIT,
PG_LOGGED, , 0,
L_MANUAL, NULL,
PG_XFER, 1, ,
,
0);
\end{verbatim}
A_ENTRY, rec_type_A_update,
"rec_type_A_update");
isis_entry(TYPE_B_ENTRY, rec_type_B_ulogex2-2.tex 666 420 420 466 4253771514 6204 \begin{verbatim}
gaddr = pg_lookup("sample_group");
.
.
/* Begin sequence of updates */
update1
update2
.
.
/* End sequence of updates */
log_flush(gaddr);
\end{verbatim}
,
0);
\end{verbatim}
A_ENTRY, rec_type_A_update,
"rec_type_A_update");
isis_entry(TYPE_B_ENTRY, rec_type_B_ulogex3.tex 666 420 420 7150 4431565023 6055 \begin{verbatim}
#include "isis.h"
#define TYPE_A_ENTRY 1
#define TYPE_B_ENTRY 2
#define QUERY_ENTRY 3
#define REC_REPLAY 4
static address *gaddr; /* Sample group address */
main()
{
int service_maintask(), rec_type_A_update();
int rec_type_B_update(), rec_query();
int group_change(), rec_replay_msg();
isis_init();
/* Declare tasks and entry points */
isis_task(service_maintask, "service_maintask");
isis_task(group_change, "group_change");
isis_entry(TYPE_A_ENTRY, rec_type_A_update,
"rec_type_A_update");
isis_entry(TYPE_B_ENTRY, rec_type_B_update,
"rec_type_B_update");
isis_entry(QUERY_ENTRY, rec_query, "rec_query");
isis_entry(REC_REPLAY, rec_replay_msg,
"rec_replay_msg");
isis_mainloop(service_maintask);
}
service_maintask()
{
int group_change(), end_replay();
/* Join the server group and setup logging */
gaddr = pg_join("sample_group",
PG_INIT,
PG_LOGGED, ,
REC_REPLAY, L_AUTO,
end_replay,
PG_XFER, 1, ,
,
0);
pg_monitor(gaddr, group_change, 0);
isis_start_done();
}
group_change(gview_p, arg)
groupview *gview_p;
int arg;
{
message *log_msg;
/* Process the change in group membership */
.
.
.
/* Log the state change due to the new view */
log_msg = msg_gen( );
log_write(gaddr, log_msg);
msg_delete(log_msg);
}
rec_type_A_update(msg_p) /* Receive a type A update */
message *msg_p;
{
message *log_msg;
/* Process a type A update */
.
.
.
/* Log the new update */
log_msg = msg_gen( );
log_write(gaddr, log_msg);
msg_delete(log_msg);
}
rec_type_B_update(msg_p) /* Receive a type B update */
message *msg_p;
{
message *log_msg;
/* Process a type B update */
.
.
.
/* Log the new update */
log_msg = msg_gen( );
log_write(gaddr, log_msg);
msg_delete(log_msg);
}
rec_query(msg_p) /* Receive a query */
message *msg_p;
{
/* Process a query */
.
.
.
}
rec_replay_msg(msg_p) /* Receive a logged message */
message *msg_p;
{
/* Process a replayed log message by applying
the summarized changes to the process state.
Note that the message may summarize either a
type A update, type B update, or change that
occurred due to a change in group membership. */
msg_get(msg_p, );
Apply to process state;
}
end_replay() /* End-of-replay processing */
{
/* Update the recovered state to reflect the new
group view that exists after recovery */
.
.
/* Make sure the cleanup actions are logged */
log_checkpoint("sample_group");
}
\end{verbatim}
sg;
/* Process the change in group membership */
.
.
.
/* Log the state change due to the new view */
log_msg = msg_gen( );
log_write(gaddr, log_msg);
msg_delete(log_msg);
}
rec_type_A_update(msg_p) /* Receive a type A update */
message *msg_p;
logtool.tex 666 766 212 215362 4752356411 6422 \label{Ch:logging}
\label{Ch:wide}
It is often desirable for an application to be able to recover from the
\index{{\tt pg\_join}, logging feature}\index{\em logging, activated from {\tt pg\_join}}
failure of some or all of its processes. Replicated data, as presented
earlier, is one method of achieving {\em partial} failure resiliency. By
replicating an important piece of data, an application increases the
likelihood that that piece of data will be available in the presence of
failures. However, the data will be lost in the event of a simultaneous
failure involving all the processes maintaining copies of the data. In this
chapter we present two tools, the {\em log manager} and the {\em spooler},
that each provide resiliency of
data even in the case of a {\em total} process failure.
The two tools address different problems. The log tool permits a
process at a site to recover the state it was in the last time it was
running, in the case where the group to which it belonged experienced
a total failure. The log is physically associated with the process; if
the group had 10 processes, they will each have a private log when
this tool is in use. Moreover, the log tool ``hides'' behind an
active process group: if a group is not active, any logs it created
last time it was running are just sitting passively in the ISIS log
directories for the machines at which it last ran, and will remain
passive until the next time one of those machine restarts.
A spool is used as a sort of ``filter'' between a set of processes and
a service with which they communicate {\em asynchronously}. The
spooler is an {\em active} entity: it will generally be operational at
all times, and accessible at all times, even if the group for which it
is spooling is down. This is significantly different from the
interface provided by the log tool. On the other hand, the contents
of a spool are basically a log of all the communication that was sent
to this service---that is, to a process group. The idea is to use a
spool for asynchronous communication with a group that shows up ``now
and then'' to read its spool and clear it, and to use logs to make it
possible for a process group that normally remains continuously
operational to recover back into the state it had last time it failed.
Spools are almost never used for synchronous
communication---communication with replies. Logs are often used with
groups that reply to the messages they receive.
Spools obviously have one advantage over logs, namely that they are
accessible even when the application on behalf of which they are spooling is down.
They gain this flexibility at a cost:
communication via spools is indirect and hence may be slower than when
using the logging tool, which imposes only a very minor overhead.
As an example, we use the spool facility as an interface to remote LAN's running
ISIS. In this case, the process group that sends to the remote LAN
is only running when the link is up, but the spool is always
available and hence the user who wants to send to a remote LAN need
not worry about whether the link is currently operational.
Thus, the designer of an application must pick between the default (no tolerance
of total failures), logging (bound to the location of the process being logged) and
spooling (location transparent and done on behalf of the whole group, but
restricted to asynchronous communication).
The spooler has a second role in ISIS: it can be used to link ISIS systems
running on multiple, separate LAN's into an integrated wide-area system.
This interface includes a way to transfer large files from machine to
machine, and is being extended to support a type of {\tt cbcast} and
{\tt abcast} protocol, optimized for long-haul communication.
The logging tool has no such ``dual existence''.
We start by discussing the logging tool, then turn to the spooler.
\section{Logging tool}
The log manager is designed for situations in which a process group is
used to replicate data. The log manager is invoked by specifying the
{\tt PG\_LOGGED} option to {\tt pg\_join} and will make the state
of the corresponding group resilient to total failure. The log manager
achieves this resiliency by periodically logging a copy (a {\em checkpoint})
of the group state onto stable storage and by logging all changes that occur
to that state. Upon recovery from total failure, a process group may use
the logged checkpoint and changes to reconstruct the state of the group as
it existed prior to the failure.
The log manager features two modes of operation. In the {\em automatic} mode,
the operation of the log manager is transparent to the application. The log
manager automatically maintains the group log and handles the group state
recovery after total failure. The automatic mode is limited, however, to
groups that are fairly well behaved, as will be explained below. When the
process group being logged is not well behaved, the log manager provides a
{\em manual} mode of operation. In the manual mode, process groups are
responsible for specifying the contents of their logs and how those contents
should be handled upon recovery. The manual mode is more widely applicable
than the automatic mode, but requires more programming effort.
In addition (slightly confusingly), the logging tool supports a separate
notion called {\em manual flushing}. Whereas the manual logging mode is
concerned with control over the data that gets put in a log, manual flushing
is concerned with control over when the log gets written.
In this latter mode, log writes are accumulated in memory until the user
explicitly forces them to be flushed, an operation which is atomic in a
database sense.
Manual logging and manual flushing are separate issues: one can use either
mechanism or both in a given application.
\subsection {The Automatic Mode}
\index{{\em logging}, automatic mode}
When the state of a process group is completely determined by the set of
incoming messages to the group, the group state may be made resilient by using
the log manager in automatic mode. In this mode, the log manager
automatically handles all aspects of logging. This includes logging
all incoming messages that change the group state, taking and logging
periodic checkpoints of the state, and recovering the group state after
a total failure. In this subsection we describe how to set up the log manager
in automatic mode. The handling of recovery, in automatic mode, is discussed
later in subsection~\ref{sec:recovery}.
\subsubsection*{Invoking automatic mode}
\index{{\tt pg\_join}, automatic logging}
The log manager is invoked in the automatic mode by specifying the
{\tt PG\_LOGGED} option to {\tt pg\_join} as shown below. All members of
a logged process group should specify {\tt PG\_LOGGED} when joining the
group. The four arguments given after {\tt PG\_LOGGED} will be explained
throughout this chapter.
\begin{verbatim}
pg_join( .
.
PG_LOGGED, fname, 0, flush_type,
end_replay,
.
.
0);
char *fname;
int flush_type;
int (*end_replay)();
\end{verbatim}
When a process invokes {\tt pg\_join} with the {\tt PG\_LOGGED} option, a log
\index{{\em logging}, log file names}
file is created for that process with the name {\tt fname}. The log file is
created in the ``logs'' sub-directory of the ISIS startup directory. Note that
by default, this directory will be /usr/spool/isis/logs. However, this default
may be overridden using the ISIS argument {\tt -d} (see the
appendix on ``Setting Up ISIS''). The log file will contain the process' local
log of the group's state changes. Each member of the group will have its own
log file for the group, so care should be taken that each member uses a
unique {\tt fname}. Unpredictable behavior will result if an
{\tt fname} is used by more than one process, or for more than one group.
{\tt pg\_join} behaves nearly the same with
{\tt PG\_LOGGED} as it does without it. With {\tt PG\_LOGGED}, however,
{\tt pg\_join} may take some special actions when the process is restarted
after a failure. These special actions are discussed later in
subsection~\ref{sec:recovery}.
In order to complete the setup of automatic logging, each member of a
logged group should specify the set of entry points at which updates
for the group state are received. For each such entry point, this is done
by invoking
\begin{verbatim}
isis_logentry(gaddr, entry)
address *gaddr;
int entry;
\end{verbatim}
\index{\tt isis\_logentry}
with the address of the logged group and the entry point number. It is
important that {\tt isis\_logentry} be called for each entry point that
receives messages which will cause the group state to change. All logged
entry points should be declared between the call to {\tt pg\_join} and
the call to {\tt isis\_start\_done}. An entry point can be logged for
only one group. If more than one group attempts to log the same entry
point unpredictable behavior will result.
Nothing else needs to be done to set up a process group for automatic
logging. Once the process group is created, it will behave just as if
logging were not being used. The log manager automatically handles all
logging functions. This includes taking periodic checkpoints of the
group state and recording all update messages received by the group.
It should
be pointed out that the log manager obtains checkpoints using the normal
state transfer mechanism. Thus, members of logged groups should specify
state transfer routines when joining logged groups (see the {\tt PG\_XFER}
option to {\tt pg\_join}). Checkpoints occur periodically throughout the
normal execution of the group and appear to the group members as normal
state transfers.
\subsubsection*{Parameters used by the log tool}
Several of the parameters used by the log manager can be modified under
program control.
The
{\tt log\_ignoreold(true/false)}
\index{\tt log\_ignoreold}
routine causes the log manager to ignore any old log contents that
may be found at restart time. This is useful if an old log is
believed to have been corrupted.
The {\tt PG\_LOGPARAMS} parameter to {\tt pg\_join} permits the
user to specify four constants used by the log manager during its
\index{{\tt PG\_LOGPARAMS}, option to {\tt pg\_join}}
run. These are specified in order: {\tt nmbuf, nmdsk, lflen, timer}.
Each is an integer, and in all cases the value 0 means that ISIS should
use the default value.
\begin{description}
\item[nmbuf]
Number of messages that can be buffered in memory before flushing the log.
\item[nmdsk]
Number of messages that can be logged before making a checkpoint.
\item[lflen]
Number of bytes of data that can be logged before making a checkpoint.
\item[timer]
Frequency (in seconds) at which checkpoints should be made.
\end{description}
\subsubsection*{An example of automatic logging}
Below we give an example of how to use the logging mechanism in automatic
mode. The example contains code for a sample server. The server may be
replicated by running it on several different machines. In the example,
the server receives updates of two types: type A and type B. In addition,
the server receives periodic queries. Only requests of type A and B will
effect the server's state. Thus, only these entry points are logged.
\input{logex1}
\subsection {Recovery in the Automatic Mode}
\label{sec:recovery}
Upon the recovery of an automatically logged group from a total failure,
the log manager automatically reconstructs, in the recovering members, the
state of the group as it existed prior to the failure. Actually, the
state is reconstructed in the first process that succeeds in recreating
the group. Other members receive the state later when they join the
group, via the normal state transfer mechanism. This subsection describes
exactly how the state is reconstructed and what recovering members will
see.
\subsubsection*{Restarting after a failure}
\index{{\tt pg\_join}, automatic recovery from a log}
\index{{\em logging}, recovery sequence}
As each member recovers from the failure, it will call {\tt pg\_join}
to rejoin the failed group. Depending on the state of the group and
the state of the members logs, one of several things will occur when
{\tt pg\_join} is called.
If the group already exists (i.e. it has already been
recovered), the recovering member simply joins the
group as it would any existing group, receiving the (recovered) group
state from some existing group member through the normal state transfer
mechanism.
If the group does not exist, {\tt pg\_join} will attempt to recreate it,
with the recovering process as the sole member. {\tt pg\_join} will also
invoke the log manager to reconstruct the group state, in the recovering
member, as described in the next subsubsection.
{\tt pg\_join} will fail, however, if the log of the recovering member is
out-of-date. This is signaled by the return value {\tt IE\_MUSTJOIN} from
{\tt pg\_join}. An out-of-date log can occur, for example, if the recovering
member had previously failed well before other members of the group and
therefore did not log the most recent state of the group. If {\tt pg\_join}
fails due to an out-of-date log, the
recovering member must wait for the group to be created by some other process
that {\em has} logged the most recent state of the group. The recovering
member may then rejoin the group, receiving the up-to-date state via the
normal state transfer mechanism. This can all be done by repeatedly pausing
and re-executing the call to {\tt pg\_join} until the join is allowed.
As long as the group does not exist, the calls to {\tt pg\_join} will fail
with the error {\tt IE\_MUSTJOIN}. Once the group has been recovered by
an up-to-date process, the call will succeed and the recovered state will
be transferred.
\subsubsection*{Reconstructing the group state}
If the process succeeds in creating the group, then its log does
contain the most recent state of the group and the log manager may
reconstruct the group state from it. To do this, the log manager first
resets the recovering member's state to the most recent
group checkpoint. This is done using the normal state transfer mechanism
in {\tt pg\_join}, so no special handling is required on the part of
the recovering member.
The log manager completes the state reconstruction after the call to
{\tt isis\_start\_done}. At this time, the log manager replays, to the
recovering member, all group update messages that occurred subsequent
to the time when the restored checkpoint was originally taken.
The log manager delivers all replayed messages exactly as they were
originally received. This includes delivering them in the same
order in which they were originally received and delivering them to the same
entry point at which they were originally received. It should be pointed
out, however, that the log manager changes the destination field of
these messages to contain only the address of the recovering member.
The recovering member should reprocess a replayed message just as
it originally processed the message. This will cause the same sequence of
updates to be applied to the checkpoint that were originally applied
to the group state after the checkpoint was taken. Note that no special
programming is necessary in the recovering member to deal with the
message replay. The log manager uses the existing mechanisms to restore
the group state. When the replay has completed, the group state should
be as it was prior to the failure.
\subsubsection*{New requests}
After the group state is restored, the recovering member may begin
processing new group requests. The log manager will insure that no
new request is received by the recovering member until after the
replay has completed. This insures that new requests are always
processed in an up-to-date state.
\subsubsection*{Notes on automatic recovery}
It should be pointed out that the log manager plays no part in
recovery from partial group failures. When only some of the members
of a group fail, those members may recover (and acquire the group state)
by simply joining the group.
The example given earlier is completely set up for recovery. Any checkpoints
will be handled by the normal state transfer send and receive routines. And,
all replayed messages will be processed exactly as before.
Actually, the log manager does not guarantee that a group will be
recovered in exactly the same state that was present immediately prior
to the failure. Due to efficiency considerations, the log manager does
not log messages immediately when they are received. Instead, the log
manager buffers messages when they are received and later logs them to
stable storage in sets. Failures at inopportune times may therefore
cause the log manager to lose the latest set of messages received by
the group (even if the group
has processed these messages). In these cases, the log manager will
recover the group in the state that was present at the time when it last
emptied its buffer. This should, however, be a recent state and so
most applications should not be effected. If more stringent guarantees
about the recovered state are needed, the {\tt log\_flush} function
should be considered (see subsection~\ref{sec:flushing}).
It should be pointed out that once a recovering member succeeds in
recreating the group and begins reconstructing the group state, the log
manager inhibits all other joins to the group
until the replay is completed. This insures that no other
process can see the group state until the recovery is completed.
As we mentioned early, the receipt of a replayed message appears nearly
identical to its original receipt. Because of this, most
applications should not need to take any special actions to handle
the replay. However, if special actions are required to process a
replayed message, a recovering process can determine which messages are
\index{\tt log\_in\_replay}
part of the replay by testing the variable {\tt log\_in\_replay}. When
{\tt log\_in\_replay} is true, the process is currently receiving replayed
messages. When it is false, the process is receiving new messages. The
process may also determine the set of groups it recovered by querying
\begin{verbatim}
log_recovered(gaddr)
address *gaddr;
\end{verbatim}
\index{\tt log\_recovered}
with the address of a logged group. {\tt log\_recovered} will return true
if this process recovered the group state from its log, and false if not.
\subsection {End-of-Replay Processing}
\index{{\em logging}, cleanup actions}
Once a recovering group member has completed the processing of
replayed messages, its state should be the same as before the
failure. In most applications, this state should be sufficient for
the member to correctly process new requests. However, because the
group's environment has changed since the failure (group membership
has changed, the clock has changed, etc.), it is sometimes necessary
for the recovering process to take special actions in order to
bring the reconstructed state up-to-date with the current environment.
For example, the recovering process may wish to reassign tokens or
release resources held by failed members of the group. Because of this,
the log manager allows process group members to specify a special
cleanup function to perform at the end of the message replay.
The special cleanup function for a group is specified at the time
of {\tt pg\_join} using the
{\tt end\_replay} argument to {\tt PG\_LOGGED}. This argument
should be set to the address of the function to invoke at the
end of replay, or NULL if no cleanup actions are required. Note that
processes may specify one cleanup function for each logged group of
which they are a member. After a recovering process has completed
its replay phase, the log manager will invoke the {\tt end\_replay}
function for each group the process recovered. These cleanup functions
will be invoked before any new messages arrive at the recovering process
and before any other processes are allowed to join the recovered groups.
Thus, clients and joining members of the group will not see the group
state until it is brought up-to-date with the current environment. Cleanup
functions should not take any blocking actions. If they do, new messages
may arrive at the recovering process before the cleanup is completed and fail
to be logged by the log manager.
Most cleanup functions should include a call to
\begin{verbatim}
log_checkpoint(gname)
char *gname;
\end{verbatim}
\index{\tt log\_checkpoint}
immediately before their completion. This function will force a
checkpoint of the cleaned-up state to be taken by the log manager.
If this is not done, the state changes that occurred during the
cleanup action may not be recorded by the log manager. They may
therefore be omitted during future state reconstructions. Processes
should not call {\tt log\_checkpoint} at any location other than at
the end of a cleanup action. Calls elsewhere in the application may
cause synchronization errors in the log manager, resulting in incorrect
states being reconstructed after future failures.
\subsection {Manual Log Flushing}
\label{sec:flushing}
\index{{\em logging}, states recovered}
As we mentioned earlier, the log manager does not log update requests
immediately when they are received. Instead, it buffers these messages
and only periodically flushes them to the log. Thus, after a total
failure, the logs of the last group members will not contain the
state of the group as it existed immediately prior to the failure
but the state of the group as it existed at the time of the last
flush. And, this will be the group state recovered by the log manager.
\subsubsection*{Automatic flush option}
Normally, when joining a logged group, the {\tt flush\_type} argument
to {\tt PG\_LOGGED} is set to {\tt L\_AUTO}, as was the case in the
example given earlier in this chapter. {\tt L\_AUTO} tells the
log manager to {\em automatically flush} the group. This means that
the log manager will automatically decide when to flush the group's
message buffer. Thus, after a total failure, the group may be recovered
in an arbitrary state, depending on when the log manager decided to flush
the group message buffer.
Although this factor does not effect the correctness of many applications,
it may have a disastrous effect on some. For example, consider an
application in which updates are submitted to a group in sequences. All
messages in a sequence are related and must be recovered atomically.
That is, for each sequence, the group should be recovered in a state that
contains all of the messages in the sequence or none of them. The
recovery of only part of a sequence leaves the group in an incomplete
state from which it does not know how to proceed. Under automatic
flushing, it is quite likely that the log manager would recover
such a group in one of its ``incomplete'' states. Thus, logging with
automatic flushing cannot be used with this type of a group.
\subsubsection*{Manual flush option}
\index{{\em logging}, controlling recoverable states}
For this reason, the log manager provides an option called
{\em manual flushing}. Manual flushing is designed to be used by
groups that can only be recovered in a limited number of states. With
manual flushing, the application can control when a logged group's
buffer is flushed, and thus in what states the group may be recovered.
Manual flushing is enabled at the time of {\tt pg\_join} by setting
the {\tt PG\_LOGGED} argument {\tt flush\_type} to {\tt L\_MANUAL}.
With manual flushing, the log manager must be told when the logged
group is in a state that would be correct to recover after a failure.
This is done by invoking
\marginpar{\em \small With manual flushing enabled, a log is flushed
only when {\tt log\_flush} is called.}
\begin{verbatim}
log_flush(gaddr)
address *gaddr;
\end{verbatim}
\index{\tt log\_flush}
with the address of the group when such a state occurs. {\tt log\_flush}
may only be called by members of the group being flushed. If it is called
by a non-member, {\tt log\_flush} will return the error code
{\tt IE\_NOTMEMB}. If it is necessary for a non-group member to decide
when to flush a logged group, the non-member can broadcast a message
to the group requesting that the flush be performed. Upon receiving such
request, the group may pick a single member (for example, the lowest ranked
member) to invoke {\tt log\_flush}.
Note that it is only necessary for one member to invoke {\tt log\_flush}.
It is not, however, an error for more than one member to call it for the same
state, although this may lead to some inefficiency.
{\em In manual flushing mode a log will never be flushed except when
{\tt log\_flush} is explicitly invoked, and will be buffered in
memory (growing without limit) if the flush primitive is not invoked
sufficiently often.}
After a total failure, the log manager will attempt to recover the group
in the state that was present at the time of the last invocation of
{\tt log\_flush}, even if that invocation was unsuccessful. It may not
be able to do this, however, if the last invocation was unsuccessful.
If this is the case, the log manager will recover the group in the
state that was present at the time of the last successful invocation
of {\tt log\_flush}.
\subsubsection*{An example of manual flushing}
As an example, consider the sample server given earlier in this chapter.
Suppose the members of the server group make updates to the group state
in sequences that must be recovered atomically. To handle this, the sample
servers would be changed to use the manual flushing option. This would be
done by replacing their calls to {\tt pg\_join} with
\input{logex2-1}
A sequence of atomically related updates may then be issued by a group member
as follows:
\input{logex2-2}
Note that these code changes will only work for update sequences that are
submitted by group members. If sequences are submitted by non-members, the
invocation of {\tt log\_flush} at the end of the update sequence would have to
be replaced by a broadcast to the group requesting that a {\tt log\_flush} be
done. Upon receiving such a request, the lowest ranked group member should
invoke {\tt log\_flush} on behalf of the requesting client.
\subsubsection*{Notes on the use of manual flushing}
Care should be taken if more than one client can issue a set of
atomically related updates. If two clients issue sets concurrently,
it is possible that one client may complete issuing its set of
updates and invoke {\tt log\_flush} before the other client
finishes invoking its set of updates. If this occurs, it is
possible that the log manager may later recover the group in the
state that existed at the time of the first invocation of {\tt log\_flush}.
That is, the group could be recovered in a state that contains only
part of the second client's set of updates, violating the atomicity
constraint on sets of updates. Programmers should be careful that
their applications really do invoke {\tt log\_flush} only in
recoverable states. Many applications will naturally satisfy
this restriction. For example, applications in which only one client
issues updates at a time. Others applications, however, may require
explicit synchronization, such as semaphores or locks, in order to
guarantee the restriction.
The programmer must also ensure that when manual flushing is used,
joins to the logged group occur only in recoverable states. We
stated earlier that the log manager will recover a manually
flushed group in the state that existed at the time of the last
invocation of {\tt log\_flush}. It is also possible, however, for
the log manager to recover such a group in the state that existed at the
time of the last successful group join, but only if this join occurred
subsequent to the last successful invocation of {\tt log\_flush}. Thus,
applications should inhibit joins to manually flushed groups when those
groups are in non-recoverable states.
{\tt log\_flush} may also be used with the {\em automatic flushing}
\index{{\tt log\_flush}, and automatic flushing}
option. Under automatic flushing, however, the log
manager only guarantees that some state subsequent to the last successful
invocation of {\tt log\_flush} will be recovered after a failure.
{\tt log\_flush} may therefore be used to restrict the age of the state
that can be recovered for a logged group.
It should be pointed out that the implementation of {\tt log\_flush}
uses a {\tt gbcast}. Thus, the use of {\tt log\_flush} is somewhat
expensive. Care should be taken that {\tt log\_flush} is not
invoked too often, or severe degradation of the application's
performance may result.
\subsection {Restrictions on the Automatic Mode}
\index{{\em logging}, well-behaved process groups}
Earlier we stated that the automatic mode of logging (with automatic or
manual flushing) can only be used with process groups that are well
behaved. In this subsection we define precisely what it means for a
process group to be well behaved.
\subsubsection*{Timing}
First, members of an automatically logged group must process logged
messages at the time they are received. As each group member receives
an update message, a task is started by ISIS in the member to handle
the received message. The log manager assumes that at the time this
task terminates, the processing of the update is complete; no residual
processing remains.
Care should be taken when using guarded actions, asynchronous broadcasts,
task forking and signaling, or any other asynchronous actions. When used
to handle an update request, such actions can leave residual processing
remaining well after the original update message task completes. Such
residual handling can cause synchronization problems in the recovery
manager that cause erroneous states to later be recovered for the group.
It is the responsibility of the programmer to ensure that no such
residual handling remains once an update message task has completed.
\subsubsection*{Effects of environment}
\index{{\em logging}, effects of environment}
Second, the processing of update messages by a group should be
independent of the group's environment. Replayed messages are
received in a different environment from the one in which they were
initially received. Group memberships have changed, the contents of
files may have changed, processes have failed, etc. The log manager
assumes that the processing of a replayed message has the same
effect on the group's state that the processing of the message
originally had when it was first received. Because of this, the
processing of update messages should not be effected by
environmental variables. Below we list several environmental variables
that may have changed between the original receipt of a message and its
replay. The programmer should be sure that no environmental variables
effect the processing of update messages.
\begin{enumerate}
\item
The site view may have changed since a replayed message was originally
received. Sites may have failed or recovered.
\item
Groups may have failed, recovered, or had membership changes. This also
implies that the addresses of groups may have changed. This includes the
group being recovered.
\item
Other processes may have failed or recovered. Note that this implies
that the sender field of a replayed message may contain the address of
an inactive process. It also implies that the PIDs of application
processes may have changed, including those of the recovering group
members.
\item
Group and process states may have changed. This may have resulted
from either a failure or through normal group/process evolution.
\item
The timing between the receipt of replayed messages may not be the same as
the timing between the messages when they were originally received.
Note, however, that the log manager does guarantee that all replayed
messages are received in the order in which they were originally received.
\item
A recovering process may not see the same set of messages during replay
that it originally saw. The recovering process will only see logged
messages for groups that it recreated. Non-logged messages and messages
logged for groups this process did not recover (but only joined in progress)
will be omitted from the replay sequence.
\item
Monitors and watches will not trigger the same events that the
recovering process originally saw. These events are not logged by
the log manager.
\item
The system clock will have changed.
\item
The contents of writable files may have changed.
\item
The news service may contain different postings than it originally
did. This may even include postings made by the recovering process
before its failure.
\item
Resources held by the recovering process may not be the same as when the
process originally ran.
\item
Signals may not be repeated, although new ones may occur.
\end{enumerate}
\subsubsection*{Externally visible actions}
Note that the restriction on the use of environmental variables implies
that members of well behaved groups should not process updates in a
way that is dependent on actions being taken by other processes.
In the new environment, a recovering member might not be able to count
on the same outside actions being taken that were taken in the
original environment.
In addition, well behaved group members should not process updates in a
way that causes non-idempotent actions to be viewed outside the member.
Under automatic logging, a member reprocesses replayed messages just
as it processed the messages originally. Thus, during replay, a
recovering member will repeat all externally viewable actions that
it originally took for the replayed messages. If these actions are
not idempotent, erroneous behavior may result. Note that some
externally viewable actions are always idempotent, and may therefore
be used by well behaved process groups. For example, members of logged
process groups may {\tt reply()} to replayed messages. Any broadcast
pending a reply from a failed group will terminate with only a limited
number of replies. Any future invocations of {\tt reply()} to this
terminated broadcast, including those during replay, will be ignored.
Members of well behaved groups may, however, take non-idempotent
actions in response to {\em non-logged} messages. For example, in the
sample server given earlier, the processing of queries may cause
broadcasts or other non-idempotent actions to take place. Such
actions will not be repeated during replay because queries are
not replayed.
\subsubsection*{Determinism}
The state of a well behaved group should be completely deterministic
with respect to the sequence of message it receives at logged
entry points. The group state should not be affected by other
forms of communication such as replies or signals.
Groups that are not well behaved, and may not therefore use the
automatic logging mode, should use the manual mode of logging
(see subsection~\ref{sec:manlogging}).
\subsection {The Manual Mode}
\label{sec:manlogging}
\index{{\em logging}, manual mode}
In addition to the automatic mode of logging, the log manager provides
another mode of operation called the {\em manual} mode. The manual mode
is designed to be used by process groups that are not well behaved, as
described above, and cannot therefore use the automatic logging mode. The
manual mode is more versatile than the automatic mode but requires more
programming effort to use. With manual logging, each member of a logged
process group is responsible for specifying the exact information that
should be logged and how that information should be handled on recovery.
With this mechanism, a process group may record information about the
environment in which messages are received and processed, so that that
information is available when the log is replayed. In addition, with
the manual logging mechanism, a process group may record state changes
that occur not as a direct result of received messages.
\subsubsection*{Invoking the manual mode}
\index{{\tt pg\_join}, manual logging option}
The log manager is invoked in manual mode by specifying the {\tt PG\_LOGGED}
option to {\tt pg\_join} as shown below.
\begin{verbatim}
pg_join( .
.
PG_LOGGED, fname, replay_entry,
flush_type, end_replay,
.
.
0);
char *fname;
int replay_entry;
int flush_type;
int (*end_replay)();
\end{verbatim}
The arguments {\tt fname}, {\tt flush\_type}, and {\tt end\_replay} are
the same as before. The argument {\tt replay\_entry} is described later
on in this subsection.
The manual mode of logging behaves much the same as the automatic
mode. However, in the manual mode, each member of the group is
responsible for telling the log manager exactly what must be logged,
no entry points are automatically logged.
Whenever a change occurs to the state of a manually logged group, each
member of the affected group should call
\begin{verbatim}
log_write(gaddr, msg_p)
address *gaddr;
message *msg_p;
\end{verbatim}
\index{\tt log\_write}
with the address of the group and a message containing a
description (in any form) of the change that occurred. Group members
should call {\tt log\_write} immediately after making the specified
change to their copy of the group state; if {\tt log\_write} is called
before the state change occurs, the log manager may encounter
synchronization problems that cause it to later reconstruct incorrect
states after failures.
\subsubsection*{Recovery in manual mode}
\index{{\tt pg\_join}, recovery in manual mode}
\index{{\em logging}, manual mode recovery}
The log manager handles recovery from failure under the manual mode
much as it handles recovery under the automatic mode. When a
recovering process attempts to join a logged process group, the log
manager checks to see if the group already exists. If the group exists,
the recovering process joins the group like any other join. If the group
does not exist then the log manager will attempt to recover the group
state from the process' log. If the process' log does not contain the
most recent state for the group, {\tt pg\_join} will fail with the ISIS
error {\tt IE\_MUSTJOIN}. In this
case, the recovering process must wait for someone else to recover the group
and then it may join. If the process' log does contain the most recent
state of the group, the log manager will pass the most recent checkpoint
contained in the log to the recovering process. Again, this appears as a
normal state transfer.
The recovery differences between the automatic and manual modes show
up during replay, after the call to {\tt isis\_start\_done}.
As was the case with automatic logging, a recovering member from
a manually logged group will receive, during replay, all logged
messages that were made subsequent to the time when the restored
checkpoint was originally taken. However, unlike with automatic
logging, these replayed messages are received by the recovering member
at the entry point specified with the {\tt replay\_entry} argument
to {\tt PG\_LOGGED}.
Replayed messages are delivered to the recovering member in the order
in which they were written with {\tt log\_write} and will contain the
exact message contents given to {\tt log\_write}. Upon receiving
a replayed message, the replay entry point should make the changes,
specified by the message, to the group state. It is the responsibility
of the programmer to make sure that manually logged messages contain
enough information for the replay entry point to duplicate the original
state change. Note that it may be necessary for logged messages to
contain such things as group views or token holder information, if
this information is necessary to duplicate a state change.
\subsubsection*{Notes on manual logging}
The {\tt replay\_entry} entry point should be used solely for the purpose
of applying manually logged changes to a group member's state. This
implies that it should not take any externally visible actions. It should
only alter the recovering member's state to reflect the changes summarized
in the received replay message. All blocking actions should also be
avoided.
The {\tt log\_flush} function has the same effect on manually logged
groups that it has on automatically logged groups.
\subsubsection*{An example}
Below we give an example of the manual logging mode. The example is
almost identical to the first example given in this chapter. There are
some important differences, however. Like before, the state of the
sample server changes in response to type A and type B messages.
Unlike before, though, the state also changes in response to changes in the
group membership. The server logs each of these types of updates using a
different call to {\tt log\_write}. Note, however, that all three types
of manually logged messages are processed by the same entry point during
replay. Enough information must therefore be included in the three types
of logged messages for the receiving entry point to distinguish between
them. Please note, also, that the server includes a special
cleanup function for invocation at the end of replay. This cleanup
operation would bring the recovered state into agreement with the new
group view that is present at the end of recovery.
\input{logex3}
\subsection {General Restrictions and Notes}
Throughout this chapter we have presented many restrictions on the
use of the log manager. In this subsection we present the remaining
restrictions and some notes on the operation of the log manager.
\subsubsection*{Log names}
\index{{\em logging}, log file names}
Applications should never use log file names that match the specification
``log\_temp*''. Files with names matching this specification are used
by the log manager for temporary purposes.
\subsubsection*{Restarting after a failure}
The log manager is not responsible for restarting processes after a
failure. It is the responsibility of the application's programmer to
set up the recovery manager to do this. The log manager only aids
in the reconstruction of a group's state once members have been
restarted.
After a total failure of a logged group, only previous members of
the group should attempt to recreate it. Processes attempting to
recreate a logged group, from a site that never had a member in the
group, may accidently succeed in recreating the group in its
initial state. The reason for this is that the log managers at
such sites do not have knowledge of the previous existence of the
group. They therefore believe that the group is being created for
the first time and allow {\tt pg\_join} to set up the initial group
state. If an application does start potential group members on
``unaware'' sites, it should prevent them from accidently creating
the group by including the {\tt PG\_DONTCREATE} option to
{\tt pg\_join}.
It should be pointed out that a call to {\tt pg\_join} to recreate
a logged process group may suffer a significant delay. In order to
determine if the invoking member's log is up-to-date, the log manager
may need to contact several sites on which previous group members
existed. If any of these sites are down, the log manager must wait
for them to recover in order to determine if the invoking member's
log can be used for recovery. {\tt pg\_join} will block until the
log manager can determine this information.
Note also that during restart the log manager makes use of the {\em news}
\ref{Ch:news}
service to determine which site was the last to fail.
\subsubsection*{Ordering group joins}
All processes should join logged groups in a order consistent with some
fixed total order on the logged groups in the application. This
constraint avoids a possible race condition that can occur when
multiple processes are simultaneously recovering logged groups.
\subsubsection*{Incarnation numbers}
An application should not attempt to maintain incarnation numbers
for logged groups. The log manager will automatically maintain the
incarnation numbers of all logged groups. The log manager starts
groups with incarnation number zero and increments the incarnation
number by one on each subsequent recovery.
\subsubsection*{Cleanup}
A recovering process may see remnants of its previous incarnations. This
may include such things as posted news messages and forked processes.
A recovering process group should be careful to clean up all residual
remnants of its past incarnations that may affect its current correct
functioning.
\subsubsection*{Consistency}
\index{{\em logging}, consistency between logged groups}
The issue of insuring consistency between the recovered states of
different logged groups is not simple. In the current implementation
of the log manager, no consistency guarantees are made. Any
consistency constraints that must hold between the recovered states
of different groups should be implemented by the application's
programmer. For example, consider a client that issues two updates
in sequence, one to each of two different logged groups. It may
be a consistency constraint in the application that the second update
should occur only if the first occurs. The log manager will not
enforce this constraint on recovery. It is possible for the log manager
to recover the second update when reconstructing one group's state
while failing to recover the first update when reconstructing the
other group's state. To insure the consistency constraint, the client
could invoke {\tt log\_flush} between the two updates in order to
guarantee that the first update will always be recovered.
\subsubsection*{Replay with multiple groups}
When more than one logged group is recovered by a process, the replay
phases for all recovered groups will occur concurrently. When this
happens, the log manager guarantees that all replayed messages
are received in the order in which they were originally received.
This guarantee extends to orderings between messages logged for
different groups. This guarantee also extends to orderings between
automatically and manually logged messages. All replayed messages
are delivered in the same order in which they were originally
received or written.
\section{Spooling and Long-haul communication tool}
\index{{\tt spool}}
\index{{\tt spool\_m}}
\index{{\tt spool\_getseqn}}
\index{{\tt spool\_replay}}
\index{{\tt spool\_discard}}
\index{{\tt spool\_and\_discard}}
\index{{\tt spool\_m\_and\_discard}}
\index{{\tt prepend\_and\_discard}}
\index{{\tt spool\_set\_replay\_pointer}}
\index{{\tt spool\_set\_ckpt\_pointer}}
\index{{\tt spool\_play\_through}}
\index{{\tt spool\_cancel}}
\index{{\tt spool\_inquire}}
\index{{\tt spool\_wait}}
\index{{\tt spool\_advise}}
\index{{\tt bin/spooler}}
\index{spooler}
\index{long-haul communication facility}
As discussed above,
an ISIS spool is used for {\em asynchronous} communication with
a process group that is either known to be down, or where
the group may need to spool input for fault-tolerance reasons.
A typical user of the spool is the ISIS long-haul facility,
which is only active when a line to a given remote destination
is up. All communication with this facility is via the spooler,
reducing what would otherwise require a case-by-case interaction
to a single, asynchronous case.
The spooler and long-haul facility were developed by
Messac Makpangou and Ken Birman.
the spooler interface is somewhat restricted by comparison to the remainder of
ISIS, and is intended to be used in a manner
that explicitly recognizes the long delays that may occur
between when these types of messages are sent and when they are received.
These delays make it impractical to support, for example, a ``spooled broadcast''
that would spool a request until the destination service becomes available
and then perform the broadcast and return whatever replies are received.
(The user who wishes to implement the equivalent functionality can do so using
a ``call-back'' approach.)
The basic interface to the spooler is as follows:
\begin{verbatim}
#include "isis.h"
/* Spool a message */
id = spool(sname, entry, fmt, args, SP_KEY, args, ... , 0);
char *sname, *fmt;
int entry;
/* Spooler, message-oriented interface */
id = spool_m(sname, entry, msg, SP_KEY, args, ... , 0);
char *sname;
message *msg;
int entry;
/* get spool sequence number */
spseqn = spool_getseqn(msg)
int spseqn;
message *msg;
/* Trigger spool replay */
spool_replay(sname, SP_PAT, args, ..., 0);
char *sname;
int spool_in_replay;
/* Discard spooled messages */
spool_discard(sname, discard-pattern, 0);
/* Spool (or checkpoint) and discard as one atomic action */
spool_and_discard(sname, spool-request... 0, discard-pattern, 0);
spool_m_and_discard(sname, spool_m-request... 0, discard-pattern, 0);
prepend_and_discard(sname, spool-request... 0, discard-pattern, 0);
char *sname;
/* Set replay and checkpoint pointers for a spool */
spool_set_replay_pointer(sname, spseqn);
spool_set_ckpt_pointer(sname, spseqn);
/* Turn play-through mode on or off */
spool_play_through(sname, SP_OFF/SP_ON);
/* Cancel or inquire about a spooled message; not implemented yet */
spool_cancel(id)
spool_inquire(id)
spool_wait(id)
/* Set spooler options for a spool */
spool_advise(sname, options, 0);
/* Spooler command syntax */
bin/spooler [-v] [-l long_haul_conf] [portno]
\end{verbatim}
When using spools for communication with a remote network, the destination
network name is specified using the spooler option {\tt SP\_NETWORK} followed
by the name (a null-terminated string).
For debugging purposes, the network can be specified as ``local'',
in which case the messages sent will be treated exactly as for long-haul
communication, but
delivered on the same network as that of the sender.
Eventually, our intent is to change ISIS to infer
that remote communication is desired by examination of the group name,
a change that will conceal the {\tt SP\_NETWORK} argument.
The spooler can be contrasted with the ISIS logging facility, which is
concerned with the recovery of individual processes (associated with
specific nodes in the network) into the state that they held at the time of
a failure. Spooling is used when the destination is an entire process group,
and when the group may be offline at the time a message is sent.
By communicating through the intermediary of the spool, the sender
need not be concerned with whether or not the destination group is
operational.
The spooler is thus visible directly to the sender of a message.
Logging is used in a manner transparent to the caller, which would
be coded to deal only with ``operational'' process groups.
The standard use for a spool in ISIS involves a collection of processes
that send messages to a destination process group via the spooler, without
waiting for replies. During periods when the destination group is operational,
these messages are spooled and promptly forwarded
in the order that they were spooled.
During periods when the destination group is down, messages are spooled but
not forwarded.
Upon recovery, a process group initiated replay of spooled messages.
When the replay terminates, new arriving messages and any messages that had not
previously been fully executed are delivered in spool order.
The spooler has no way to know when execution for a given spooled
message is completed, and this raises the issue of how it can distinguish
between {\em replay} of a message that has already been executed and
{\em first time delivery} of a message that may, in fact, have been
delivered previously but which has not yet been ``executed''.
In practical terms, the spooler provides a variable that
can be tested at runtime, {\em spool\_in\_replay}; it will be
true while replay is occuring.
The method used to distinguish these types of messages in the spool
itself, however, is to
associate a {\em spool pointer} with each spool.
The pointer is under control of
the application program, which must
call spool\_set\_replay\_pointer to assign it a value.
The value supplied in a call to the
{\tt spool\_set\_replay\_pointer()}
routine should be a spooler sequence number, obtained by calling
{\tt spool\_getseqn()}.
It is illegal to set the spool pointer back; it can only be advanced.
Many applications will store checkpoints in spools, but may wish
to retain information that was in the spool prior to making
the checkpoint, as a sort of audit trail.
Accordingly, spools also have a {\em checkpoint pointer}
associated with them.
When asked to replay the contents of the spool, the spooler actually
examines only those messages between the
checkpoint pointer and the spool pointer, inclusive.
In contrast to the spool pointer, which must always be
set explicitly, the spooler will automatically
set the checkpoint pointer when the spool\_and\_discard
or prepend\_and\_discard operation is invoked, leaving it
pointing to the checkpoint message.
The checkpoint pointer can be explicitly modified
by calling spool\_set\_ckpt\_pointer giving the sequence
number as an argument.
A spool thus has the following structure:
\begin{verbatim}
(start; small seqn numbers)
old messages
checkpoint
portion that can be replayed
new messages
(end; large seqn numbers)
\end{verbatim}
The checkpoint pointer has as its value the sequence number of the current
checkpoint. The replay pointer can have any value; it points to the last
message in the portion of the spool that has already been processed.
The ``old'' messages will only be replayed if the
checkpoint pointer is moved first.
The replayed portion of the spool consists of the checkpoint itself
plus a series of
``incremental updates'' to it.
Any spooled messages subsequent to the spool pointer
may be new -- or may be
ones that were seen before, but where there was a crash
before the spool was moved forward.
The spooler interface is as follows:
{\tt spool}
puts a message in the {\em spool} for a named process group. Normally, this
group would be one that is believed to not be operational.
The
{\tt spool\_m}
varients allow the message to be spooled to be precomputed and are analogous to
calling {\tt bcast\_l} and specifying the `m' option.
On recovery, a group triggers spool replay either by invoking
{\tt spool\_replay}
or by specifying the
{\tt PG\_REPLAYSPOOL}
argument to {\tt pg\_join}.
Notice that spool replay is not automatic in ISIS; it must
be activated explicitly.
During replay, the flag {\tt spool\_in\_replay} will be non-zero.
Only messages with spooler sequence numbers greater than or
equal to the checkpoint pointer and smaller than or equal to the current
spool pointer will be replayed.
Moreover, replay allows messages to be replayed selectively, using a replay pattern.
For example, say that an application spools all types of messages, but that
only some messages are needed to recover after a failure.
A replay pattern can be specified that will supress replay of the ``irrelevant''
messages. On the other hand, their presence in the spool may be useful in other
ways, for example to exactly recreate a scenerio that has been causing a process to crash.
After replay has finished, any additional spooled messages in the spool
or any new messages that are received by the spooler are ``played through''
immediately upon
reception, and this continues
so long as the process group remains operational.
Play through can be disabled by calling
{\tt spool\_play\_through(),}
but is activated by default.
Unlike messages being replayed, play-through
messages are NOT subject to any sort of pattern-matching process.
When
{\tt spool\_play\_through()}
is used to disable play-through, the procedure must
be called {\em before} calling
{\tt spool\_replay()}
(or {\em pg\_join}).
Otherwise, some play-through may occur during the interval after the replay
completes and before your program is informed of it.
Play-through messages are not delivered until after
{\tt isis\_start\_done()}
is
called in cases where replay is initiated during startup.
Programs must explicitly discard the contents of a spool.
This is done using
{\tt spool\_discard}.
Finally, the procedure
{\tt spool\_and\_discard}
atomically discards some of the messages in a spool and appends a new
message (normally a checkpoint) to the end of the spool.
(A varient form, prepend\_and\_discard, is also available but is rarely used;
this places the checkpoint at the start of the spool).
The checkpoint pointer is simulataneously updated to
point to this new checkpoint.
The following additional spooler functions are not yet implemented.
{\tt spool\_cancel(id)}
provides a way to cancel a pending request.
{\tt spool\_wait(id)}
blocks until a specified request has been replayed.
{\tt spool\_inquire(id)}
returns 0 if the request is still spooled and 1 if it has been replayed.
{\tt spool\_advise(sname, options, 0)}
provides an interface with which the caller can create spools
having special characteristics (non-standard resilience, size limits, etc).
Currently, all spools have the same degree of resiliency to failures and
no size limit is enforced.
\subsection{Installation hints}
The spooler program itself is normally run as part of the isis.rc startup
sequence.
It is usual to run the spooler only on machines with local disks, and to
run 3 to 5 copies of the spooler for an entire LAN.
The spooler program has three optional arguments, which can be given in
any order.
(In future releases of the spooler, it may make sense to run
more than 5 copies of the spooler if possible.)
{\tt -v}
places the program in {\em verbose} mode; it prints a description of essentially
every operation it performs on the console.
This is intended for debugging only.
{\tt -l long\_haul\_conf}
is used to specify a long-haul configuration file, as discussed below.
{\tt portno}
is a port-number for connecting to ISIS, and need only be specified if you wish to
override the default value found in /etc/services or /yp/etc/services.
{\tt spool} puts a message in the {\em spool} for a named process
group and delivers it promptly (``plays it through'') if the process
group is operational. The {\tt sname} argument is the name under
which the group will run when it restarts. The {\tt entry } argument
tells what entry point this message should be delivered to upon
replay. The {\tt fmt} is a format from which the message should be
create; the arguments are as for {\em msg\_put}.
A zero-terminated series of optional keywords describing this message
follow. Each keyword in the series consists of a name---we define a
basic set, but you can extend it---and perhaps arguments associated
with that name. There are currently three sorts of keywords: numeric
ones, which have an integer value; timer keywords, which take a long
integer argument of the sort returned in the {\em seconds} (tv\_sec)
field of the timeval structure by gettimeofday(2); and SP\_KEYWORDS
which takes a null-terminated list of strings as its argument.
The type of broadcast used for actually transmitting to the group will
normally be {\em cbcast}. This is certain to work correctly if all messages to the group
are sent via the spooler. However, if a group receives some of its messages directly,
you may need to specify the broadcast
type. This is done by including the key SP\_FBCAST, SP\_CBCAST, SP\_ABCAST or SP\_GBCAST,
with no argument.
The spooler currently does not predefine any numeric message keys.
Instead, the user is permitted to define up to 9 such keys.
This should be done using {\em define} and specifying values in the range 1-9
inclusive.
A numeric key should be immediately followed by its value in the call to spool.
There is currently only one timer key that the user would explicitly specify
in a call to spool: SP\_EXPIRES. The argument to SP\_EXPIRES is an absolute time
at which this message ``expires''.
The argument should be computed by calling gettimeofday(\&now) and then
computing {\tt now.tv\_sec+delay}, where delay is a delay in seconds
between the time of the call and the time when the message expires.
An expired message will never be delivered to a client, but neither will it
actually be deleted from the spool
until the next time that {\tt spool\_discard} call is called.
A spooled message can also have a list of ascii strings associated with it.
Such a list, null-terminated, should follow the keyword SP\_KEYWORDS.
The following illustrates a very complex call to the spool routine as it might
be done from C; the corresponding interface is also supported from FORTRAN and LISP.
\begin{verbatim}
#define SP_ID 1
#define SP_EPOCH 2
....
sid = spool("dbserver", ADD_RECORD, "%s,%d", "Richard Nixon", 68,
SP_ID, db_idno++,
SP_EPOCH, current_epoch,
SP_EXPIRES, now.tv_sec+60*60*12,
SP_KEYWORDS, ``add'', 0,
0);
\end{verbatim}
The above example uses an ``id'' number and an ``epoch'' number, but
the reader should be aware that these have no special meaning to the
spooler. On the other hand, the spooler {\em does} assign all spooled
messages a sequence number on a per-spool basis, which is incremented
for each received message. The spooler delivers messages sequentially
in order of increasing sequence number, except during replay when
messages from the start of the spool up to and including the current
spool pointer are subject to a pattern and will not be replayed unless
the pattern matches.
The spool request returns the spooler sequence number that was
assigned to the message. Given a message that was sent via the
spooler, its sequence number can be obtained directly by calling {\tt
spool\_getseqn(mp).} This function returns 0 when applied to a message
that was never spooled. The special keyword SP\_SPSEQN can be used to
specify limits on the sequence number as part of a replay or discard
pattern. For any given spool, sequence numbers are strictly
increasing.
{\tt spool\_replay}
triggers replay of a spool.
Replay can be selective; for example, one can replay just the
messages from a particular sender or just the messages with spooler sequence
numbers larger than a specified value.
A pattern is specified very much as the set of keys for a message,
but where a key typically specifies a value, a replay pattern
typically specifies a rule that the value must satisfy for the message
to be replayed. If several replay constraints (patterns) are given,
all must be satisfied for a given message to be replayed.
In the case of a numeric key, a low and high bound are given (either
can be SP\_INFINITY, however). Only messages that included the
designated key and have a value greater than or equal to the low bound
and less than or equal to the high bound. For example, {\tt
spool\_replay(sname, SP\_ID, 55, SP\_INFINITY, 0)} replays all
messages in the spool {\em sname} with the user-defined numeric key
SP\_ID in the message and having a value of 55 or greater, inclusive.
As noted above, the spooler's internal sequence number can be treated
as a numeric pattern using the predefined keyword SP\_SPSEQN. Note,
however, that replay will only be applied to messages betwen the
checkpoint pointer of the spool and the current spool pointer.
The time at which a message was spooled can be used as part of a
pattern. SP\_ATIME places bounds on this time in absolute time units.
SP\_RTIME places bounds on this time relative to the time at which
spool\_replay was called.
The process that sent or spooled a message can also be part of the
pattern. SP\_SENDER takes a single address which is the address of
the sender whose messages are to be replayed. SP\_SPOOLER works the
same way, but takes the address of the process that invoked spool.
Note that unless {\tt spool\_m} is being used, the sender is by
definition the same process as the spooler. In the case of spool\_m,
however, the message could be one that was received from some other
source.
If string keywords were specified, the pattern SP\_KEYWORDS can be used to
enforce a 1-1 exact match. The number of strings and their values must
match for the message to be replayed.
To replay all the messages in a spool, one would call spool\_replay(0).
After a spool\_replay is done, the spooler plays through any
messages that are received and that match the ``current'' replay
pattern, with the single exception of any message received from a
spool\_and\_discard request (in this case, the spooled message normally is a checkpoint,
and hence playing it through would cause confusion).
It will also spool these messages upon reception.
This play-through behavior continues
so long as the destination process group remains accessible, or
until spool\_play\_through is called to inhibit further playthrough.
{\tt spool\_discard}
is called just like
{\tt spool\_replay.}
It deletes any spooled messages between the start of
the spool and the current spool replay pointer
that {\em match} the specified pattern,
retaining in the spool any messages that {\em do not} match the pattern.
If an empty pattern is specified, all messages will ``match''
and be discarded.
{\em NOTE:} if the spool pointer has not been
set and a call to
{\tt spool\_discard}
is issued,
the discard
operation will retain {\em all} messages that were
in the spool.
This is because the discard pattern is applied only to
messages in that portion of the spool up through (and including)
the message with sequence number equal to the spool replay pointer value.
{\tt spool\_and\_discard}
combines a call to
{\tt spool}
and a call to
{\tt spool\_discard}
into one atomic operation.
In the arguments associated with the message to be spooled one may specify
SP\_PREPEND, in which case the new message will be stored at the front of the spool.
Otherwise, the new message is appended to the end of the spool, which normally the appropriate
place to store a checkpoint.
The checkpoint pointer is changed to point to the new message.
For example, say that one wishes to make a checkpoint, and that the
application has just received spooled message
for which spooler\_getseqn(mp) returned 66.
A good way to do this would be to issue the sequence of calls:
\begin{verbatim}
spool_set_replay_pointer(spname, 66);
spool_m_and_discard(spname, chpt-msg, 0);
\end{verbatim}
The first of these tells the spooler that messages up through 66 have
already been ``consumed''. The second call atomically modifies the
spool by appending the checkpoint message to it and deleting all
messages up the this point---{\em all} because no pattern was
specified, and an empty pattern matches all messages. Any other
messages in the spool with sequence numbers greater than 66 are
retained.
\subsection{Long-haul communication}
ISIS spools are also used for communication with remote networks.
A network is a set of ISIS sites isolated from the rest of the ISIS word.
The only way for remote networks to communicate is through the
long-haul communication package. Each network has a name.
A network's name looks like a group's name and has the same size limit as
the group's one.
We begin by focusing on the basic mechanism, but later give a brief
description of the long-haul file transfer facility and the long-haul
broadcast protocol suite.
The static description of networks is in a file.
This file contains the default tcp port number (first information
available in the file), and
a set of entries, each describing one specific network.
Each entry is composed of
the described network's name, and a null-terminating
list of its hosts descriptors.
Each host is specified either by its internet host name (in which case
this name is prefixed by ``N:''), or by its internet address (in which case
this address is prefixed by ``A:'').
A host's descriptor contains also the tcp port number. A host's name (or address)
is separated from the reserved port number by the slash (`/') character.
If the port number is zero, the long-haul package uses the
default value.
The following illustrates a long-haul network configuration file:
\begin{verbatim}
2200
norway N:thor.cs.cornell.edu/0 N:hymir.cs.cornell.edu/0 0
sweden N:sif.cs.cornell.edu/1800 N:sigyn.cs.cornell.edu/1800 0
usa N:utgard.cs.cornell.edu/0 A:128.64.27.3/0 0
\end{verbatim}
This defines three ISIS networks, one in Sweden, one in Norway, and one in
the United States.
Each network has a set of designated ``contact points'' -- normally, a few of the
file servers, because one normally runs copies of the spooler only on file
servers (anything else wouldn't make sense).
For example, regardless of how many machines reside in the Norway LAN, only thor and
hymir are declared as running the spooler and will be used for connections
from Sweden and the USA.
The port number used for connections will be 2200 except when communicating with
Sweden.
One host in the United States is declared using its
internet host address.
Each list of contact machines is zero terminated.
The long-haul package
establishes connections between the local and remote networks.
For each remote network described in the networks configuration file,
one of the running hosts is designated as the manager of the connection
with this partner.
Each designated manager tries to connect to one of the remote
network's host. The designated manager tries successively different hosts
until one accepts the connection request.
Each long-haul process may be in charge of more than one long-haul connections.
The long-haul package ensures automatic reconnection in the case
of failure of an existing connection.
It also preserves the state of a failed connection and makes it available
to the new manager.
This allows the
{\tt at most once delivery semantic}
in presence of connections failures and hosts crashes.
To enable long-haul communication, you must first inform the spooler
for this by calling the spooler command with the option
{\tt -l yourNetFileName.}
For example, if the file ``/usr/spool/isis/long\_haul.rc'' contains the configuration file
shown above, your isis.rc file might contain a line: \break
{\tt bin/spooler -l /usr/spool/isis/long\_haul.rc}
You trigger the long-haul communication by specifying the remote network name using
the {\tt SP\_NETWORK} option when you issue a call to
{\tt spool}
or
{\tt spool\_m}
procedures.
For example: \break
{\tt spool\_m("rule-checker", QUICKCHECK, msg-to-spool, {\tt SP\_NETWORK}, "sweden", 0);} \break
{\tt spool\_m("rule-checker", QUICKCHECK, msg-to-spool, {\tt SP\_NETWORK}, "local", 0);}
\subsection{File transfer}
This interface is new and is described in the on-line manual page
lh\_file\_transfer(3).
The facility provides an efficient way to transfer a file over
a long-haul link; it achieves the full bandwith possible over such
a link with a simple program-level interface.
File naming is automatically handled by the tool.
\subsection{Long-haul multicast}
This interface is experimental, and implements versions of {\tt cbcast}
and {\tt abcast} for use over long-haul links. The manual page
lh\_bcast(3) gives details.
\subsection{Constants}
The number of networks you can define is limited to MAXNETS,
and number of sites in a network is limited to MAXSITES. Both of
these constants are defined in util/long\_haul.c.
\subsection{Warnings}
The long-haul package uses tcp ports to realize inter-LAN communication.
One consequence of this is that, when a long-haul process fails,
a delay greater than 30 seconds must elapse before attempting to
restart the
process.
Otherwise, the tcp port will be unavailable (due to a tcp protocol
feature), an error message
will be logged on the system console where ISIS was started up,
and the network will be unreachable from the rest of the system.
This initial version of the long-haul
facility does yet implement recovery from a total network failure.
If all
spoolers on one network fail,
long-haul communication state and messages spooled
are lost. This deficieny will be corrected in future versions of the
facility.
the long-haul package uses the
default value.
The following illustrates a long-haul network configuration file:
\begin{verbatim}
2200
norway N:thor.cs.cornell.edu/0 N:hymir.cs.cornell.edu/0 0
sweden N:sif.cs.cornell.edu/1800 N:sigyn.cs.cornell.edu/1800 0
major.tex 666 372 212 156445 4673510414 6047 This chapter describes the major elements of the
ISIS system in more detail.
ISIS is structured into three major layers.
\begin{itemize}
\item
The {\tt process group} layer of the system implements virtually
synchronous process groups. This layer manages the list of
operational sites, the membership of process groups, and
the process-group name space.
\item
The {\em toolkit} layer
is accessed directly from within application programs via a
subroutine call interface.\footnote{
Although designed for use from the C programming language, ISIS has
evolved as the demands on the system have changed. At present,
ISIS can be used from C++, several versions of LISP, Fortran, and in
some situations, Prolog. The appearance of an ISIS application
program thus varies substantially depending on the purpose of the system.
However, the basic Toolkit abstractions are uniform across all
languages and interfaces. }
Some operations are performed locally to
the client program that issues such a call, while others require
actions by the process group layer or that messages be multicast to
other process group members. Moreover, some ISIS tools are implemented
using utility programs that must be operational if the tool is to be
employed. By and large, the user has the option of starting these sorts
of utilities, or excluding them, on a per-site basis.
In this manner, the amount of code running on a machine can be trimmed
to reflect the needs of a particular application.
\item
Whereas the term ``utility'' connotes a small program or application
with fairly narrow scope, ISIS also includes some major applications
that represent complex subsystems in and of themselves.
Examples include the DECEIT file system, the META realtime sensor
facilities, the DAM distributed application manager, and the NEWS
subsystem. These ISIS tools can often be accessed directly from a
terminal, rather than at the subroutine library level,
and represent a powerful complement to coding each ISIS application
directly in a lower-level programming language.
\end{itemize}
Before considering these topics in greater detail,
it will be useful to say something about how ISIS has been
implemented---its ``runtime architecture''.
This discussion is not intended for use in actually setting up ISIS.
Rather, this tour of the system is provided to help potential users
familiarize themselves with the programs of which ISIS is composed.
Instructions on setting ISIS up and administering it appear in Appendix A.
\section{ISIS runtime architecture}
The ISIS system consists of a collection of programs that are started
on each machine where ISIS facilities will be accessed directly.
Only two of the programs are always needed: {\tt isis} and {\tt protos}.
Moreover, the names of these two programs are somewhat misleading.
Below, we write ISIS (in caps) to refer to the entire set of programs
making up ISIS on a particular machine; {\tt isis} (in {\tt typewriter} font)
denotes the specific program called {\tt isis}.
When we talk about a remote client connecting to ISIS, this implies that
the remote machine supports the ISIS client library but connects to ISIS
on some sort of a {\em mother} computer. Thus, the collection of
ISIS programs might not be running on every machine in the network, but
it definitely needs to run on a subset of them.
The system calls used to connect to ISIS are {\tt isis\_init} if the
caller is local to ISIS, and {\tt isis\_remote} if the caller is
remote (actually, a local caller can also use {\tt isis\_remote}, but this
results in slightly slower response times on some operations).
A call to {\tt isis\_probe(freq,wait)} is used to tell ISIS to begin watching the
client.
ISIS will probe it once every {\tt freq} seconds and
kill the process is no response if received after {\tt wait} seconds.
By default, ISIS will not probe local clients and will probe
remote clients every 60 seconds, killing them if there is no response
within a further 60 seconds.\index{\tt isis\_probe}
\begin{itemize}
\item
{\tt isis} is a short program that starts up the
ISIS system at a site and monitors its health, and subsequently handles
calls to {\tt isis\_remote} and
{\tt isis\_probe}.
This can seem confusing, since the name of the program suggests
that it is somehow the ``core'' of the whole system, whereas it
is actually only a little more than a compiled shell script that forwards
requests on behalf of remote clients if {\tt isis\_remote} is in use.
If there is a ``core'' to ISIS it is the {\tt protos} process.
To run ISIS on a machine, one executes {\tt isis} and supplies it with
two arguments: a {\tt sites} file that lists the machines in the
network with which it will be communicating, and an {\tt isis.rc}
file that lists the utilities needed on this machine.
Often, using NFS, {\tt isis} will be started out of a directory shared by
many machines. In these cases, {\tt isis} is also given an
argument specifying a private place where files needed by the ISIS
utilities can be stored.
The basic start sequence is as follows. {\tt isis} first tries to
find out if ISIS is running on any of the machines listed in the sites
file. It does this by sequencing through the list sending a message
to the {\tt isis} process at each listed site.
The UDP port number it uses for this is specified in the {\tt sites} file;
it is the last one in the list of 4-digit numbers after the site-id number.
Since UDP messages are unreliable, there is always a small chance that
a message will get lost and prevent {\tt isis} from contacting one of
the listed sites. Consequently, it runs this contact attempt twice.
Suppose that this is the first machine to be booted and that no other ISIS
sites are up.
After two attempts to contact other sites, {\tt isis} will not have
received any replies. It now concludes that a {\em total restart}
is underway.
To initiate such a restart,
{\tt isis} first runs the {\tt protos} process, copying any arguments
given in {\tt isis.rc} and sometimes adding extra arguments of its own.
Telling {\tt protos} to do a total restart signifies that
this copy of ISIS is the only one running in the network, and
{\tt protos} computes a new site-view listing only this site.
{\tt isis} prints this out on the console.
In addition, {\tt isis} starts up those utility programs
needed at the site, if any, following commands in {\tt isis.rc}.
Finally, {\tt isis} sits and waits for other ISIS startup attempts,
monitoring for termination of the utility programs at its site, and
watching for {\tt isis\_remote} requests.
During this period, it will consume very little CPU time and may even
be swapped out by UNIX.
Its main activities are to forward system calls from remote clients to {\tt protos}
and to forward replies back to the remote clients.
Alternatively, perhaps ISIS was already running on machines {\tt rati} and
{\tt moose} when {\tt isis} is started at machine {\tt sigyn}.
During the initial start sequence, {\tt isis} on {\tt rati} and/or
{\tt moose} will have received an inquiry message from {\tt isis}
on {\tt sigyn}, sent as a UDP packet.
They respond by sending back a list of currently active ISIS sites.
In this case, {\tt isis} on {\tt sigyn} starts {\tt protos} in what
we call the {\em partial restart mode}, meaning that the new site
should be added to a pre-existing list of sites.
This requires that {\tt protos} register itself with the active {\tt isis}
sites and may take a little longer than
creating a completely new site list ({\em site-view}).
Again, {\tt isis} prints the site-view when the restart is completed and
becomes idle.
After restart is completed, {\tt isis} has two roles. First, it
responds to restart queries for the remainder of the time that the system
is running.
This may not be an ideal architectural decision, since it slows
restarts down ({\tt isis} may be swapped out when the message arrives) hence
we are considering moving this function into {\tt protos}.
Additionally, if one of the processes that {\tt isis} started in the
{\tt isis.rc} sequence terminates (exits), {\tt isis} prints a message
to this effect. If that process was {\tt protos}, {\tt isis} will
either restart {\tt protos} or shut everything down, depending on
the command-line options used when the system was started.
Similarly, if {\tt isis} is killed, {\tt protos} will detect this
and exit, any other programs that were using ISIS typically shut down
at this point too.
\item
{\tt protos} implements the process group and virtual synchrony mechanisms
of the overall system.
It does this through a communication protocol involving the other {\tt protos}
processes in its {\em cluster}, and to a lesser extent by communicating
with {\tt protos} in other clusters.
Here, the term {\em cluster} refers to a set of as many as 128 fairly
closely coupled ISIS sites.
While ISIS can run in a local area setting with many hundreds of
sites, this is only possible if the sites are subdivided into some
number of smaller clusters. Ideally, cluster sizes should be limited to
about 32 sites.
\end{itemize}
The remaining processes that {\tt isis} starts up are associated with
specific utilities that you may chose to include or exclude on a per-site
basis (note, however, that some utilities depend on others).
\begin{itemize}
\item
{\tt rexec} is the ISIS utility for remotely executing a program.
Thus, if an application on {\tt sigyn} needs to start a program on
{\tt rati}, it would normally do this through the intermediary of {\tt rexec}
on that machine.
\item
{\tt news}
is a utility used by some ISIS programs to persistently store small amounts
of information. It is a program-level analog of the network news
facility, and represents a convenient but short-term memory mechanism.
\item
{\tt lmgr}
is a utility used to maintain logs on behalf of {\em logged processes}.
Such processes can recover their state after a crash/reboot sequence.
Each time ISIS is restarted at a site, this program cleans up logs and exits.
When you see a message from {\tt isis} about its termination
you should not be concerned about this.
You need not use the {\tt lmgr} program unless some applications need
this ability.
Note that {\tt lmgr} is the only ISIS utility that terminates shortly after
startup.
The {\tt lmgr} utility uses the {\tt news} utility.
\item
{\tt rmgr}
is the ISIS {\em recovery manager}.
This program is used to automatically restart user applications that
should ``always'' run on a given site.
It needs the {\tt lmgr} utility.
You need not use the {\tt rmgr} program unless some applications need
this ability.
\item
{\tt xmgr}
is the ISIS {\em transaction manager}.
This program is used to implement database-style transactions using ISIS,
and makes use of {\tt lmgr}.
You need not use the {\tt xmgr} program unless you develop transactional
applications or need to run the bank demo.
\item
{\tt spooler}
is the program that implements persistent spooling facilities within ISIS.
This program also handles long-haul links between ISIS systems on
different local area networks.
The spooler makes no use of other facilities and is not run on every ISIS system.
In fact, one normally runs the spooler only on a machine with
a local disk, and rarely runs more than 3 or 4 spoolers (in total) on
any given local area network.
\item
{\tt meta}
is a component of the {\em META} system, with realtime
sensors and actuators and is quite elaborate.
{\tt meta} uses {\tt lmgr} to store persistent state.
\end{itemize}
An ISIS site could include all of these
utilities, which can be listed in {\tt isis.rc} in the order given.
However, it would be more typical to omit one or more of them.
For example, if you never use {\tt rexec} you might prefer not to run it,
since it gives users a way to run programs on a destination machine
(subject to permissions, of course).
Running the spooler on machines without local disks would be silly,
since the spooler is only useful if its spools are accessible,
and there is no advantage to having multiple spoolers associated with
any single physical file system.
This also argues for site-specific {\tt isis.rc} files.
On the other hand, by using a trimmed-down {\tt isis.rc} file,
you may run into problems down the road if you install some
ISIS application that needs a utility you decided to omit.
These are decisions that should be made on a per-installation basis.
There are additional ISIS programs in the ISIS binary directory, but
these are mostly demos concerned with illustrating particular aspects
of the system. We describe these in Chapter \ref{Ch:demos}.
The remaining architecturally important component of the system is
the Toolkit library.
The Toolkit is composed of subroutines that
are linked directly into application programs. These are split into
three groups: the message utilities ({\tt mlib/libisism.a}), the
optional client utilities ({\tt clib/libisis1.a}), and the mandatory
client utilities ({\tt clib/libisis2.a}).
These routines typically cache some information, but
perform other operations by remote procedure calls to {\tt protos}.
Thus, most of the toolkit is just an interface to facilities actually
implemented by {\tt protos}.
One exception to this is that client programs sometimes send
messages directly to one another when using the {\em bypass multicast
protocols} for communication.
This idea is explained in more detail below.
Two examples will illustrate the overall architecture. Say that
your program wishes to obtain the {\tt address} of process group
{\tt "/nrmd1/sigpro"}. As we saw in Chapter 1,
this is done by a subroutine call to {\tt pg\_lookup}
(defined in {\tt clib/lib2.a}). {\tt pg\_lookup}
keeps a local cache and may be able to respond to your query without
a call to protos; otherwise, it does an RPC and blocks waiting for
{\tt protos} to send back the answer.
Protos may know the answer locally, if so it responds
immediately. If not, a multicast is used to track the group down.
The rule is that the mapping of a process group name to
its logical address is always known at the sites
where the processes in the group reside, and can be cached elsewhere for
speed.
Now, say that your application wishes to save a {\tt news}
item. It does this by sending a {\tt bcast} with one
destination and waiting for one reply.
The {\tt bcast} subroutine converts the request into a message,
which it forwards into {\tt protos} using an RPC. {\tt protos}
delivers the copy to {\tt news}, which responds with a message of its own.
This message is sent back to
{\tt protos}, which forwards it back your application.
On arrival, the toolkit library matches the reply with the call
task and the {\tt bcast} subroutine that sent the news item returns
to its caller.
All of these indirections and RPC's to {\tt protos} have a high
cost. On
many systems, basic local RPC times are as high as 8ms, and even fast
workstations reduce this to 1 or 2ms at best.
Thus, one would gain a substantial speedup by communicating {\em directly}
with a destination process, rather than {\em indirectly} via {\tt protos}.
The trick is to do this without losing the virtual synchrony and
fault-tolerance properties that make ISIS useful.
When {\em bypass} communication is used, an ISIS application does just this.
Messages sent by the program are transmitted directly to destination
programs without passing via
{\tt protos}. However,
bypass communication can only be used under certain restrictions.
In the next section, we discuss these in enough detail to enable you
to design applications that will definitely exploit this efficient form of
communication.
\section{Process group and communication level}
The preceding section should have left you with a good sense of the
overall ISIS architecture.
We now review the layers of the ISIS system itself.
As noted earlier, the logically lowest level of the system is concerned
with maintaining process groups and implementing group communication.
As a user, the main issues that arise in this level relate to performance.
A process group is in some ways analogous to an open communication channel.
Creating a group, and changing its membership, are comparatively costly
operations, much as opening a communication channel is costly.
On the other hand, communicating with a group is potentially very cheap.
In most applications, process groups change membership infrequently,
but communication is comparatively frequent. It follows that
the {\em relative}
overhead of creating and managing a group membership list
is normally low. For this reason, we generally urge new ISIS users to
think of groups as an inexpensive facility that can be used casually
in an application.
On the other hand, one can certainly imagine settings where this
cost would become an issue.
As a user grows sophisticated, the question of performance becomes
a more critical issue. ISIS offers several ways that
sophisticated users can avoid potential
performance bottlenecks, such as the one that would arise
in an application that might need
very large groups or need to change group membership very frequently.
Similarly, the ISIS communication primitives vary greatly in their cost.
To a first approximation, we tend to urge users to multicast to groups,
on the assumption that most applications will multicast infrequently
with respect to the amount of other work done,
and hence will perform quite well regardless of the speed of the ISIS
multicast primitive used.
However, in a sophisticated application, the system
architect will certainly want to understand what structures
yield especially efficient communication paths. Again, ISIS offers
a range of mechanisms for the sophisticate, while providing simple
defaults that work well in situations where performance is not a dominant
consideration.
\subsection{Group membership changes}
The operations by which a process group is created or joined, and the
operation that looks up the ISIS address of a process group by its
symbolic name, are both fairly expensive ones, with costs ranging from
perhaps 20ms to more than 400ms depending on the situation (on a SUN 3/60) and
the number of processes in the group.
These comparatively
heavyweight mechanisms can be used in most applications, but should be
avoided in the following specific situations:
\begin{itemize}
\item
{\em The membership of a new group will be a subset of the members of
an existing group.}
In this case, it would be extremely costly to send a message to all
the processes involved and have them
invoke the {\tt pg\_join} routine concurrently to join the new group.
Instead, the {\tt pg\_subgroup} facility should be used; this allows a
single member to create the new group in one operation.
\item
{\em The membership of a group changes extremely rapidly.} For example, as
the META system detects events, the set of sensors it is monitoring
changes rapidly. A consequence is that if META used a single process group
to represent such a set, it might change the group membership once for
every multicast done.
In such situations, one should use a {\em process list} to maintain
information about the processes to which information is to be sent.
Process lists have some of the properties of a process group, but can
only be used in a restricted manner.
For those operations supported (membership change and multicast) they
are extremely inexpensive. However, they lack many of the features of
a full process group. For example, one cannot associate a symbolic name
with a process list, look it up from some other process, or monitor its
membership. Somewhat like a subgroup, a process list is associated with
a {\em parent process group}. Only members of the parent group can belong
to the list.
\item
{\em A process group might get very large.}
The protocols used to maintain ISIS process groups slow down as the
number of members grows. Consequently, a group with more than 20 or 30
members is about as large as can be efficiently supported.
This forces you to design applications in ways that keep
process group membership small in most situations.
If a large group is needed, it should normally be fragmented into
smaller subgroups that do the actual work.
At present (in ISIS V2.0), this is difficult to do; normally, one does build
the large group but then avoids changing its membership, using subgroups
where possible. Or, one might avoid creating any overall group at all,
and instead maintain only smaller subgroups. However, fragmentation of
a big group into smaller ones should soon become cheaper and easier, because
mechanisms to facilitate this
are being added to ISIS now.
There is one exception where large groups are potentially beneficial,
arises if a group is used to communicate information on a frequent basis
to a set of processes, all of which need this information. For example,
a group might be the set of processes monitoring quotes on IBM stock as
they are read off a ticker feed. In such a situation, with a small
number of multicast sources (perhaps 1) and a large number of destinations
large process groups will often give the best possible performance.
However, few applications use this sort of group widely, and this
approach can only be used if the group membership changes very rarely.
In ISIS V2.1, the hierarchical grouping facilities will include solutions
to this problem that have slightly slower multicast
performance, but also have much reduced overhead when membership changes.
This will enable the user to optimize performance based on the
relative frequency of these two types of operations.
\end{itemize}
\subsection{Communication primitives}
In the first chapter we saw how the ISIS {\tt bcast} primitive is
used to communicate with a process group. Performance of this interface
is clearly critical in communication-intensive interfaces.
Thus, it is important to realize that
{\tt bcast} performance varies a great deal, depending on the
context in which the multicast is done.
The {\tt bcast} primitive is in fact a shorthand for a protocol we call
{\tt abcast}, or {\em atomic broadcast}.
In addition to {\tt abcast} ISIS also provides the
multicast primitives {\tt fbcast, cbcast,} and {\tt gbcast}.
These four primitives differ in the type of multicast ordering
they provide, and in their cost: {\tt fbcast} and {\tt cbcast}
are cheapest (in most cases these have identical costs, although
there are some specific situations where {\tt fbcast} is faster),
{\tt abcast} is about twice as costly as these, and {\tt gbcast}
about three times as costly as {\tt fbcast} or {\tt cbcast}.
We will not discuss the differences between these protocols for the
time being; Chapter \ref{Ch:order} covers this subject.
Because they are not identical, it is not always safe to simply
substitute one for another, and we have developed a fairly elaborate theory
for deciding when
a multicast can be done using {\tt fbcast} or {\tt cbcast}
rather than {\tt abcast}.
(We use {\tt abcast} as the default because it is
almost always a ``safe'' choice, i.e. one that yields virtually synchronous
executions.)
In the early versions of ISIS, the difference in performance between
the cheapest primitive {\tt fbcast} or {\tt cbcast}) and the more
costly ones (including {\tt abcast}) was large, and it was crucial
to understand the different characteristics of these facilities to obtain
good overall performance. This theory was thus quite important to our
work and got a lot of visibility in our papers on ISIS.
In the more recent versions of
ISIS, this performance gap has been substantially narrowed. Although
it is still significant, the choice of multicast primitive is
no longer the most important
determinant of performance in a multicast application.
Today, most of the cost of actually doing one of these multicasts is determined
by the way that data is shipped from the sending process to its destinations.
This is a problem over which even a novice ISIS user can have
considerable influence without understanding a difficult theory.
Basically, ISIS can do a multicast in either of two ways.
A {\em bypass} multicast is done by sending messages directly
from the process that does the multicast to its destinations.
If a process does a rapid series of $n$
{\em bypass} multicasts to $k$ destination processes, ISIS will
normally send roughly $n k$ messages to move the data, and perhaps
as few as $n$ if ethernet multicast can be exploited (even fewer
if piggybacking is possible).
A non-bypass multicast is done by first issuing an RPC from the
application process to the local {\tt protos} program. {\tt protos}
does the multicast as a proxy of the client, using a protocol that
is normally similar in cost to the {\em bypass} scheme, although not
invariably so.
On reaching its destination, such a message is then passed up to the
destination process using an IPC from the local {\tt protos} process.
Not surprisingly, even if the basic {\tt protos-protos} protocol is
comparable in speed to the one used for bypass communication,
this overall scheme is more costly. In fact, the cost of non-bypass
communication is from 5 to 10 times that of bypass communication,
regardless of the primitive used.
Thus, even the novice ISIS user will want to understand how to convince ISIS
to use bypass communication.
The rule is a fairly simple one.
In ISIS, a multicast can be sent to almost any list of
destinations.
A {\em destination list} could be simple: a single process or
process group. Or, it could be quite complex,
listing several processes by their ``private'' addresses, or even mixing
a bunch of addresses that include both processes and groups.
Obviously, most applications will use fairly simple forms of addressing.
This is fortunate, because ISIS is only able to use
the bypass protocols for messages with a {\em single}
destination, which can either be a single process, or a single process
group (identified by its logical address), or a process list
(identified by a logical address that ISIS assigns when the list is
created).
Moreover, in the case of a message to a single process,
ISIS will only use the bypass mode if that process is a {\em member of
a process group to which the sender belongs or a client of such a group}.
In the case of a
process group or a process list, a generalization of this rule applies:
the sender must be {\em a member of the group or a member of the parent
group associated with the list}.
If these conditions are not satisfied, ISIS cannot use the
bypass mode. It will automatically switch to the
slower, non-bypass protocols.
Clearly, it is a good idea to design applications so that a process
will often be able to multicast to a process group or list to which it
belongs (or a group of which it is a client (see {\tt pg\_client})),
\index{\tt pg\_client}
and containing more
or less exactly the processes with which it needs to interact!
This is one reason for using large numbers of process groups, or of process
lists, in ISIS: such an approach may make it possible to benefit from the
huge speedup of the bypass communication protocols.
Say that you don't have much choice, and are looking at a situation
where
the non-bypass protocols wil definitely be used.
Even here, there are several performance
issues to keep in mind. As an example, say that you design a
directory service whose clients issue lookup and update requests.
Perhaps it will be possible to create a process group containing both clients
and the service, but often this simply wouldn't make sense.
Thus, non-bypass communication may be unavoidable.
One question relates to how many of the directory servers are involved
in each request. Although ISIS is good at multicasting, it clearly
doesn't make sense to assume that all servers see every single operation.
In most ISIS applications built at Cornell, we arrange for groups like this
to either assign a subgroup of processes to each client, or perhaps assign
a single server to each client, so that most interactions with the client
are either broadcasts to small numbers of processes or point-to-point
interactions with a single process. This is
an approach that reduces tolerance to failures,
however, so some care is needed when using it.
A second point to keep in mind relates to the notion of being a
{\em client of a process group}.
We have used this term quite loosely so far, in a way that could
refer to almost any interaction in which a process makes some request on
a group of processes that provide almost any sort of service.
ISIS is actually quite flexible about communication with groups, and
allows this type of interaction to occur without any special
actions in advance of the first communication.
However, there are situations in which a little advance work can speed up
subsequent interactions between a client
and a server group.
Moreover, if security is a concern, the same mechanism used in this
connection can also be used to enforce restrictions on
the processes that can talk to a specific group.
The tool used for these purposes is the {\tt pg\_client}
\index{\tt pg\_client}
system call. {\tt pg\_client} registers a process with a group
in a way that has fairly high cost at the time the
{\tt pg\_client} call is first performed, but substantialy reduces the
cost of subsequent multicasts between the client and the group as a whole
(the cost of point-to-point communication is completely unaffected).
{\tt pg\_client} is somewhat like opening a private channel to the
group: it creates an optimized
communication pathway that ISIS can use when the client sends
subsequent messages to the process group.
Lacking this channel, ISIS has to pay a higher cost on each multicast
because of
issues relating to the way that multicasts are
synchronized with group membership changes.
We noted that {\tt pg\_client} is also used for security purposes.
ISIS allows applications to {\em filter} incoming messages so as to reject
messages from unauthorized senders.
At the time of a {\tt pg\_client} call, the application
can also authenticate the
client as a legitimate user of the group, for example using a password
or some other form of public credentials.
The approach yields a comparatively secure
channel between client and group.
Although there isn't much more that can be done to speed up calls {\em to}
a service, there are ways you can improve the performance of the service
itself.
The issue here, however, requires some understanding of when one can
substitute {\tt fbcast} for {\tt cbcast}, or {\tt cbcast} for {\tt abcast}.
Most operations that are multicast to an entire process group will expect
replies from one or more group members.
In this case, ISIS uses the {\tt cbcast} protocol by default.
It turns out that many servers do not need {\tt cbcast} guarantees at the
level of their interface to clients, and that the use of {\tt cbcast} here
can be a big source of overhead when large numbers of clients communicate
with the same server rapidly. In our own work on ISIS, we always try
to design servers so that replies to clients can be sent using {\tt fbcast},
a cheaper protocol (one does this using {\tt reply\_l} with the {\tt f}
option specified in the options string).
Substituting {\tt fbcast} for {\tt cbcast} in this situation is almost always safe,
but to be certain you will need to understand exactly how {\tt cbcast}
and {\tt fbcast} differ.
Similarly, within a service there may be costs associated with multicasts
from one server process to the remainder of the group. These
are done using {\tt abcast} by default, but can often be done using
{\tt cbcast}. Now, as it happens, the bypass protocols will normally
be used in such cases, and bypass {\tt cbcast} is not so much faster
than bypass {\tt abcast} as to justify a big investment of effort to
replace one with the other.
However, there may be situations where the performance really is a
critical bottleneck and where using {\tt cbcast} can be a big win.
We discuss both of these issues in Chapter \ref{Ch:order}.
To summarize our overall point, performance in a system like ISIS is
only sometimes dependent on the choice of multicast primitive.
The more significant issue is the overall application architecture;
only if this is well-matched to the needs of the application will
the multicast primitive become a dominant concern. Of course, to
get good performance in a sophisticated setting, it may be
necessary to override the defaults and specify the cheapest primitive that
satisfies your application requirements.
Fortunately, this is almost never the first issue that needs to
be addressed, and may never even arise in simple or small-scale applications.
To a first approximation, the defaults should be quite satisfactory and
perform well.
The defaults used by ISIS have the
advantage of being safe---when using them, one
never sees non virtually-synchronous executions.
For example, when building a server,if you do decide to
substitute {\tt fbcast}
for {\tt cbcast}, you will probably see a performance gain,
but you may also have introduced a subtle sort of
race condition that {\tt cbcast} prevents but {\tt fbcast} allows.
It is the obligation of the
application architect to verify that this is a safe substitution to make
in a particular setting.
In settings like the specific one cited above---a server with
large numbers of different clients---the performance difference
can be dramatic, and a naive solution may simply not be
adequate. \footnote{
You may now be wondering how to decide in crude terms whether
it is worthwhile to explore this sort of change in your applications.
We suggest that you code using the defaults, set up a performance
test, and experiment with forcing replies to
be sent with {\tt fbcast} rather than {\tt cbcast}.
You can do this without worrying about whether it is safe; the
odds of race condition {\tt cbcast} prevents arising are normally
quite low (except when systems reconfigure due to failures).
If the change
makes an important difference to the performance of overall system,
it may be worthwhile to try and determine rigorously if such
a substitution is a good investment of your
time. Doing so will require that you become comfortable with the
multicast ordering
properties of {\tt fbcast}, {\tt cbcast} and {\tt abcast}. On the other
hand, learning this material at some point during your experience with
ISIS is probably valuable. This knowledge makes it
fairly easy to design applications in which it is certain, from the
outset, that {\tt fbcast} can be safely used for replies and
{\tt cbcast} for other multicasts. }
Fortunately, such situations seem relatively rare, based on our initial
experience with ISIS at Cornell.
\section{Hierarchical process groups}
We noted earlier that
ISIS V2.0 is being extended at Cornell to include support for
building large process groups that are automatically subdivided into
smaller ones.
Lacking this facility, you will need to undertake such subdivision
as an explicit part of your application.
One unfortunate issue is that this may force you to {\em create}
very large groups in situations where the hierarchical group tool
would not pay the corresponding high cost.
Otherwise, however, this approach will enable you to migrate easily to
our tool when it becomes available in mid 1990.
It may be helpful to outline in general terms how this extension will work.
The basic idea is to map requests on a hierarchical group into
requests on specific component subgroups.
For example, using the hierarchical system,
a process that calls {\tt pg\_lookup}
to obtain the address of a hierarchical process group
will in fact be given the address of one of its leaf groups.
(Leaf groups may also have explicit symbolic names by which they can
be directly identified). The leaf group members perform requests
as proxies for the group as a whole, but with the ability to perform
large-group multicasts if necessary. Most ISIS tools are being
extended to operate correctly in a hierarchical group context.
Moreover, the facility is designed so that an application program
that works correctly as a single ISIS V2.0 process group can be
easily extended into a hierarchical group, with minimal changes to
any existing code.
In the hierarchical facilities, processes in a service will not necessarily
have a complete list of other processes in the service; rather, they
will know about other processes in their subgroups, and interact with
the service as a whole only though the ``large group'' multicast interface.
You may want to keep this in mind when designing applications that
will later be scaled into large settings.
To understand why hierarchy can be such a big win, consider the
places where time is spent when a multicast is sent from one
sender to a large number of destinations, say 100 of them.
Unless there is some sort of hardware multicast or protocol
that ISIS can use, and this is very rare, such a multicast requires
100 separate packet transmissions.
Tranmitting a packet is not a cheap operation: on SUN O/S it costs
about 3.4ms of compute time to send a 256 byte UDP packet on the SUN 3/60.
On the other hand, the medium (the ethernet) is only used for a tiny
fraction of this, perhaps .1ms or less. Thus, for 97\% of the time, the
ethernet is idle. It doesn't take much thinking to realize that by
having several processes transmit the multicast in parallel, one can
get nearly the full benefit of concurrency here. Sending to a
top-level group of 10 processes, each member of which forwards the message
to 10 others, will be almost 10 times as fast as sending the message
to 100 processes in succession (which would require nearly 3.5secs
in this example). The biggest win would come from using ethernet
multicast, of course: here, all 100 copies might be sent with a
single IO operation. However, few UNIX systems support ethernet
multicast drivers.
This raises another point. Our group is currently working on
a package of fast multicast transport protocols implementing schemes
like this one. As these become available, they should assist
designers with large applications to benefit from structured solutions
and from the availability of special hardware. We hope to see this
problem ``solved'' to the best of our ability within the next few years.
\section{Tools level of ISIS}
The next few chapters of the manual are concerned with the
basic programming tools supported by ISIS, focusing on the level of
a single process group. These include tools for monitoring group
membership, distributing work and data across the members of a group,
and so forth. Chapter 1 included examples of how this layer of ISIS
looks. In the interest of brevity,
we omit a more detailed overview.
\section{Utility level of ISIS}
The next higher level of ISIS is the one providing such facilities as
the log manager, the recovery manager, the spooler and long-haul
interconnection mechanism, the transactional tools, NEWS, and so forth.
The problems deal with by this level of abstraction
are primarily concerned with management of persistent data,
although one of the mechanisms (the spooler) doubles as a long-haul
communication facility.
We defer discussion of META and DECEIT to the next section, as these two
applications are substantial subsystems in and of themselves.
Likewise, we defer the issue of {\em wide area design} to Chapter \ref{Ch:wide},
and confine ourselves to a very brief summary of this problem in the
final section of the present chapter.
\begin{description}
\item{\bf Recovery manager.}
The recovery manager is used to build applications that restart
themselves automatically at a specified machine in the network.
You can also design systems that detect a failure, {\em pick} a machine
on which to restart some component, and do so, but not using this
facility. The section on META describes the recommended way of solving
this more general problem.
The recovery manager must be told what processes to restart at a
given machine and what arguments to pass them. It will perform this
task each time ISIS is rebooted on the designated machine.
\item{\bf Log manager.}
A {\em log} is a file within which a process running under ISIS stores
state checkpoints (generated much as in the state transfer example
of Chapter 1) and messages that caused the state to be updated.
By reloading a checkpoint and replaying these updates, a process should
be able to enter the same state that it was in at the time of a crash,
modulo the last operation or two, in the case where a crash prevented
the last log records from being written.
ISIS uses logs in a way that is associated strongly with a particular
process and a particular machine in the network. Logs are
an appropriate and inexpensive way to store state for these purposes.
In particular, logs are an inappropriate way to save state information that will
be loaded by some other program {\em at a different location in the network}.
Thus, while you might use a log to build a reliable database server that
remembers its state across failures, the server would normally be bound
to a single machine where its log is stored.
You may have read articles describing the use of logs to
``migrate'' a process from one machine
to another.
In ISIS, one accomplishes such migration without
actually writing the process state into a file.
Instead, one represents the subsystem that will support migration as
a process group. In normal operation, this group
consists of a single member. During
migration, however, the new process joins the group and a state
tramsfer is done to copy the group state over;
the old process then leaves the group.
Logs are recorded in a directory private to each machine,
in files whose
names are based on the names of the
process groups to which a process belongs.
The decision to make a checkpoint within a log is automatic and depends
on its length and the number of update messages logged since the most
recent checkpoint.
\item{\bf Spooler.}
A spooler manages a sort of log but on a per-group basis and in a
way that is completely independent of machine names or individual
processes.
Like a log, spools typically consist of a checkpoint followed by
some number of messages. However, their intended use is quite
different.
A log is used to save the state of a process closely
associated with a particular place in the network and, presumably,
active whenever that machine is operational.
A spool, in contrast, is used as a sort of proxy for a process group
or service that is only available occassionally. The idea is that
the spooler accepts messages on behalf of such a group, which will
periodically show up, empty the spool of its contents, and then
go away again until whatever condition created a need for the service arises
again.
An example may help clarify this idea. Consider a system composed of
two ISIS installations that are physically separated and not able
to communicate easily with one another. Information is generated by
each machine that may be of value to the other, and is saved on each LAN
for eventual transmission.
Periodically, a line is opened between the two LAN's. Such a line
is costly and should be closed as soon as possible and kept busy
while it is open. This is accomplished by bringing the line up,
pumping all the spooled data
across in each direction, and shutting the connection down
soon afterwards.
Now, the process that opens a long-haul connection might not always
be on some single machine, nor will it necessarily be continuously
operational while the line is not needed. The application therefore
uses the spooler as a proxy that accepts and saves data for long-haul
transmission. The spooler interface is always the same and the spool
file is always available. Processes in LAN A that need to send
messages to LAN B simply hand them to the spooler which saves them
in a file keyed by the name of a process group that might not already
exist. When the line is opened, that group is created and triggers a
spool {\em replay}, causing these messages to be delivered in succession.
After successful transmission, they are deleted from the spool.
In fact, ISIS supports a long-haul communication facility that operates
in precisely this way. Details are given in Section~\ref{sec:widearea}
and in Chapter \ref{Ch:logging}.
To summarize, while a log supports the abstraction of a reliable process at
some designated place that never forgets its actions, a spool provides
the abstraction of a continuously available process group, even for
a service that isn't continuously running.
Spools are almost always used asynchronously: a process spooling a message
does not wait for replies, because it has no way to predict how long
the message may sit in the spool before being processed.
Logs have no such restriction: normally, the process sending to a
logged process would have no way to realize that its requests are being
logged. Because the process being logged is actively receiving the
messages it is free to reply using any of the ISIS mechanisms.
(During log replay, a flag is set so that such the process replaying
the log can avoid replying to old messages; however, such replies would
be ignored in any case).
\item{\bf Transaction manager.}
Although the logging and spooling tools provide a useful form of
persistence, they are both oriented towards process groups that
maintain state {\em in memory}.
This is in contrast to a database model, in which the interesting
state is all stored on a disk and the in-core part of the database
is typically just a small cache of frequently accessed records.
When a database is updated, it is common to use a {\em transactional interface}
by which a series of reads and writes to one or more database files
are lumped into a single indivisible action. This has the
effect of isolating concurrently executing database users from seeing
one-another's partial updates, and also provides some level of fault-tolerance
(partially completed transactions are typically erased at restart time).
Increasingly many operating systems include transactional file interfaces.
For example, in MACH the CAMELOT system supports transactions on files
and even shared virtual memory, and in the DEC VAX-Clusters system,
file locking and transactions are cited as one of the most powerful tools
offered to the programmer.
In contrast,
ISIS does not use a database transactional model in most situations.
Rather, ISIS assumes that some of its application processes may be
using transactional facilities, such as CAMELOT or VAX Clusters, and
provides tools to facilitate doing this in a virtually synchronous
process group context.
As it happens, these tools are powerful enough so that you can use them to
build transactional file access mechanisms directly from conventional
non-transactional files. However, our expectation is that these tools
would normally be used to glue together pieces of code that access
multiple transactional database or file systems at several places in
a network.
Few database tools allow this even if the exact same system is running
everywhere, and there are currently no ways to do this if the databases
and file systems are heterogeneous ones obtained from different vendors and
accessed through multiple language interfaces.
The ISIS transactional tool is thus used primarily to fill this gap.
\item{\bf NEWS.}
The NEWS facility is more of a persistent communication mechanism than
any of the other tools described above.
NEWS is oriented towards applications in which information is organized
by {\em subject}. In some situations, it will be convenient to structure
such a system as a set of process groups; all processes with an interest
in a particular subject would join the corresponding group, and the current
state of the group would be the currently valid
information that pertains to this subject.
However, there are applications which have this structure but where it
is not convenient to assume that the processes involved can easily
form a long-lived group.
NEWS is basically a subject-oriented message
storage repository. Processes can replay and post messages under a specified
subject, messages are deleted when they become ``stale'' under rules
specified by the posting process. The messages are saved in disk files,
typically on the machine where a posting was done.
Thus, although the NEWS, spool, and logging facilities overlap in some ways,
each is oriented towards a different form of persistence, and
encourages a different programming style.
\end{description}
\section{META and DECEIT}
META and DECEIT are two large ISIS applications.
META is concerned with instrumenting a distributed system
that contains sensors (or variables whose values are worth monitoring),
actuators (entities that can take actions on request), and triggers
(expressions that designate a condition, detectable by monitoring sensors,
under which an action should be taken, expressed using actuator calls). Sensors and actuators are realtime entities, and a big part of META
is concerned with obtaining this data in a timely fashion and
saving it in an efficient database-like format from which queries and
triggers can be evaluated. The remainder of META is an interpretor
for a sophisticated query language and a set of interfaces to facilities
that include the logic language PROLOG and a relational language that
resembles QUEL.
META is designed to integrate seamlessly into ISIS, and includes a
higher level graphical interface for building monitoring environments
and hooking them into your applications. These are described in
Appendix \ref{Ch:meta}.
DECEIT is a file system built using ISIS and the NFS file system standard.
Whereas NFS provides remote access to a single file system at a time,
DECEIT offers transparent file replication and fault-tolerance, with
variable levels of replication depending on the access patterns it senses.
DECEIT is not yet available from Cornell, but should be included as part of
the ISIS system late in 1990.
Among other uses of the system, DECEIT will permit ISIS users to replace the
so-called YP service with a file-system based service giving an
identical interface, but with far higher reliability, better
scaling properties, and nearly instantaneous updates.
\section{Wide area programming}
\label{sec:widearea}
Early in this chapter we observed that the problem of scale extends
beyond any facilities that ISIS can provide transparently.
However, although the basic toolkit cannot be used
in wide-area settings, ISIS does include
other mechanisms for building
applications that span long-haul links. Typically, such
applications consist of representative process groups on each of a set of
ISIS local area network systems, communicating with one another through
the spooler and long-haul facility.
This facility is modeled after the local area version of ISIS: it
includes a way to do asynchronous {\tt cbcast} and {\tt abcast} operations
and to define wide-area analogs of process groups.
It also includes a high speed file transfer mechanism---sort
of a program level version of UUCP.
The major difference between ISIS in a local area setting and in
wide-area networks is that
whereas many local operations are synchronous and require close cooperation
between processes,
in wide-area applications generally run asynchronously and are designed to
avoid tightly coupled mechanisms.
This is because long-haul
links are typically established infrequently, used in a bursty manner, and
then closed down again (to minimize communications charges).
Long-haul applications require
a different style of programming than ISIS uses in
a local area setting---one that is loosely coupled and where
applications run as autonomously as possible.
This issue is explored in detail in Chapter \ref{Ch:wide}, where the ISIS
long-haul facilities are presented and examples of their use are given.
\section{What's missing?}
We end this chapter by asking what
aspects of ISIS are {\em currently } missing, and
what issues have yet to be addressed.
Above, we noted that ISIS V2.0 is lacking some of the mechanisms
for building very large networks (hundreds of nodes), very
large hierarchical process groups, and that some of the most exciting
high level ISIS applications are not yet available. We see this
as a temporary situation, since most of the software cited above
already exists in prototype form at Cornell, and it is just a matter of
time before it becomes solid enough to distribute more widely.
On the other hand, there are aspects of ISIS that have not
received adequate attention. An example of these is security.
The current security features of ISIS are better than nothing, but
quite weak. Another is the overall modularity of the system.
It would be nice to be able to pick and chose among ISIS features, or
to combine ISIS with a realtime protocol suite in a mix-and-match
manner. Currently, ISIS is too monolithic to be used in this manner,
and too large to run effectively on small dedicated machines.
Yet another problem is that the system assumes a ``crash failure'' model,
and could have real problems in environments where failures are less
benign.
Another important area that may seem to have been overlooked is that of
object-oriented programming and specification. In fact, ISIS fits
well into the so-called ANSA architecture, developed by a Cambridge-based
group called ISA. ANSA focuses on precisely these issues, and ISIS
is a part of the ISA testbed. If your organization is faced with
problems in this domain, we recommend that the ISA approach be considered.
Realtime is also a problem for the current ISIS system.
ISIS has been designed without concern for the special needs of realtime
control software. Integrating ISIS into such settings is an open problem.
It is unlikely that the present version of ISIS will ever work well
in settings with strong realtime performance constraints that request
fine-grained scheduling, but we are
hopeful that as ISIS becomes faster and faster, it
will prove increasingly valuable in less stringent applications.
Exploitation of physical parallelism is also a weak area within ISIS.
Even when run on a multiprocessor, ISIS still executes
tasks in a non-preemptive
way because we communicate the virtual synchrony properties of the
environment to the application through this ordering.
In fact, the manual section on lightweight tasks explains how this
feature can be disabled with relative ease. However, this is
far from a complete solution and not a fully satisfactory one.
As it happens, this
particular problem can be overcome in a somewhat awkward way,
because ISIS allows threads to exit the ISIS ``scope'' and re-enter
them, a topic considered in Section \ref{Sec:threads}.
The result is that one can spin off tasks that run concurrently with ISIS,
but become re-serialized each time they call an ISIS system routine.
Nonetheless, such a solution is less than satisfactory because it
forces the programmer to be quite knowledgable about how the ISIS
threads package operates.
A more serious examination of these issues will be needed if ISIS is
to provide the sort of value in a parallel
environment that it does in a loosely coupled networking setting.
At Cornell, we believe that ISIS has fundamental advantages that outweigh its
generally more ``practical'' disadvantages.
That is, we see little evidence that our approach has fundamental
disadvantages that rule out its use in the sorts of settings for which
the system is intended.
Many of the disadvantages that we perceive most serious
are yielding to ongoing research and development by our research group.
For example, performance was a major problem in versions of ISIS predating
ISIS V2.0, but has now been largely eliminated as a system-level problem.
Certainly, building an application to have the best possible performance
characteristics remains a challenging problem, but ISIS now offers the
mechanisms needed to solve this problem.
We see scaling as a similar problem: this has not been addressed in
a fully satisfactory way at the time of this writing, but our
group has an approach that seems likely to yield a satisfactory solution
in the near term.
It may take time, but we believe that ISIS can be extended into a
more and more comprehensive system. Feedback from users has been a
valuable incentive to our work, and any comments you might care to
direct to us (or to post to the newsgroup comp.sys.isis) would certainly
be appreciated.
nell, and it is just a matter of
time before it becomes solid enough to distribute more widely.
On the other hand, there are aspects of ISIS that have not
received adequate attention. An example of these is security.mdw.tex 666 525 212 6624 4576002674 5465 % abridged version of isis manual
\documentstyle[11pt]{report}
\makeindex
\title{The ISIS System Manual, Version 1.2}
\author{K. Birman, R. Cooper, T. Joseph, K. Kane and F. Schmuck \\ \\
\copyright 1989 by The ISIS Project \footnote
{The ISIS system was developed with support from the Department of Defense
Advanced Research Projects Agency, DARPA, under ARPA order 5378, contract MDA903-85-C-0124 and
ARPA order 6037, contract N00140-87-C-8904. The information contained in this manual
is unclassified.
}
}
% Redefine @ as the subscript char, and _ as a plain underscore?
%\renewcommand{\tt}{
%\catcode`_=\active \newcommand{_}{\makebox[1.4mm][l]{\_}}
\newcommand{\newtextwidth}[3]{
\setlength{\textwidth}{#1}
\setlength{\columnwidth}{#1}
\setlength{\hsize}{#1}
\setlength{\linewidth}{#1}
\setlength{\oddsidemargin}{#2}
\setlength{\evensidemargin}{#3}
}
\newcommand{\hrf}{\ \hrulefill\ }
\newcommand{\Marginpar}[1]{\marginpar[\raggedright #1]{\raggedleft #1}}
\begin{document}
\clearpage
\setcounter{page}{1}
\pagenumbering{arabic}
\clearpage
\markboth
{\hrf \bf Index}
{\bf Index\hrf}
\input{index}
\clearpage
\pagestyle{myheadings}
\chapter{Basic Facilities}
\markboth
{\hrf \rm Basic Facilities\hrf \bf Chapter \thechapter}
{\bf Chapter \thechapter\hrf \rm Basic Facilities\hrf}
\input{basic}
\clearpage
\chapter{More About Messages}
\markboth
{\hrf \rm More About Messages\hrf \bf Chapter \thechapter}
{\bf Chapter \thechapter\hrf \rm More About Messages\hrf}
\input{messages}
\clearpage
\chapter{The Lightweight Task Subsystem}
\markboth
{\hrf \rm The Lightweight Task Subsystem\hrf \bf Chapter \thechapter}
{\bf Chapter \thechapter\hrf \rm The Lightweight Task Subsystem\hrf}
\input{tasks}
\clearpage
\chapter{The Broadcast Interface}
\markboth
{\hrf \rm The Broadcast Interface\hrf \bf Chapter \thechapter}
{\bf Chapter \thechapter\hrf \rm The Broadcast Interface\hrf}
\input{bcinterface}
\clearpage
\chapter{The Logging Facility}
\markboth
{\hrf \rm The Logging Facility\hrf \bf Chapter \thechapter}
{\bf Chapter \thechapter\hrf \rm The Logging Facility\hrf}
\input{logtool}
\clearpage
\chapter{Broadcast Types and Order}
\markboth
{\hrf \rm Broadcast Types and Order\hrf \bf Chapter \thechapter}
{\bf Chapter \thechapter\hrf \rm Broadcast Types and Order\hrf}
\input{types}
\clearpage
\chapter{Advanced Facilities}
\markboth
{\hrf \rm Advanced Facilities\hrf \bf Chapter \thechapter}
{\bf Chapter \thechapter\hrf \rm Advanced Facilities\hrf}
\section{Building large process groups}
\input{large}
\section{Signals}
\input {sigs}
\section{Forking off a child from within ISIS}
\input{fork}
\section{Interacting with files and devices}
\input{control}
\section{The recovery manager}
\input{rmgr}
\section{Cmd---the interactive ISIS control program}
\input{cmd}
\section{Creating and interpreting client dumps}
\input {cdump}
\section{Creating and interpreting protocol process dumps}
\input {pdump}
\section{Load monitoring utility}
\input {prstat}
\section{How the system behaves under heavy load}
\input {oload}
\section{Defining new ISIS transport protocols (experimental feature) }
\input {transport}
\appendix
\clearpage
\chapter{Quick Reference}
\newtextwidth{6in}{0.25in}{0.25in}
\markboth
{\hrf \rm Quick Reference\hrf \bf Appendix \thechapter}
{\bf Appendix \thechapter\hrf \rm Quick Reference\hrf}
\input{quick}
\end{document}
t Messages\hrf \bf Chapter \thechapter}
{\bf Chapter \thechapter\hrf \rm More About Messages\hrf}
\inputmessage.tex 666 420 212 15673 4571262207 6333 \section{Messages}
A message is simply a receptacle for carrying typed data from one\index{messages}
process to another.
Typically a process creates a message, puts data into it, and uses the ISIS
broadcast mechanism to have copies of the message delivered to one or more
processes.
If the representation of a data type varies from one machine to another
(e.g. some machines store integers with the high-order byte first, others
with the high-order byte last), ISIS ensures that appropriate conversions
are made.
This section describes how messages are created and deleted, how data are
put into and read out of messages, and how to perform other operations on
messages.
It covers enough for most normal applications, but for advanced message
handling routines see Chapter 5.
\begin{func}{Create a new message}
\begin{verbatim}
message *
msg_newmsg()
\end{verbatim}
\end{func}
\index{\tt msg\_newmsg}\index{messages, creating initially empty}
\vspace{1ex}
{\tt msg\_newmsg} returns a pointer to a structure of type {\tt message}
which can be used in other operations to access the newly created message.
A pointer of this type will be called a {\em message pointer}.
\begin{func}{Copy data into a message}
\begin{verbatim}
msg_put (msg_p, formatstr, arg1, arg2, ...)
message *msg_p;
char *formatstr;
\end{verbatim}
\end{func}
\index{\tt msg\_put}\index{messages, copy data into using format}
\begin{defs}
\defn{msg\_p} Message pointer.
\defn{formatstr} Format string giving the number and type of the arguments that
follow it (see below).
\end{defs}
The interface for {\tt msg\_put} is similar to that of the
function {\tt fprintf}.
The arguments {\tt arg1, arg2, \ldots} are copied into the message.
Multiple calls to {\tt msg\_put} have the same effect as a single call
with all the arguments concatenated together, that is successive calls
result in the new data being stored after the data already in the message.
\begin{func}{Create a message and copy data into it}
\begin{verbatim}
message *
msg_gen (formatstr, arg1, arg2, ...)
char *formatstr;
\end{verbatim}
\end{func}
\index{\tt msg\_gen}\index{messages, create using format}
\begin{defs}
\defn{formatstr} Format string giving the number and type of the arguments that
follow it (see below).
\end{defs}
{\tt msg\_gen} is a shorthand for a call to {\tt msg\_newmsg} followed
by a call to {\tt msg\_put}.
It returns a pointer to a message into which data has already been copied.
\begin{func}{Read data out of a message}
\begin{verbatim}
int
msg_get (msg_p, formatstr, arg1, arg2, ...)
message *msg_p;
char *formatstr;
\end{verbatim}
\end{func}
\index{\tt msg\_get}\index{messages, copy data out using format}
\begin{defs}
\defn{msg\_p} Message pointer.
\defn{formatstr} Format string giving the number and type of the arguments that
follow it (see below).
\end{defs}
This function resembles {\tt fscanf}.
Each successive call to {\tt msg\_get} begins by reading the item after
the last one read.
It returns the number of items read (0 if there are no more items).
It aborts the scan and returns -1, setting {\tt isis\_errno} to {\tt IE\_MISMATCH},
if the type of the item in the message
fails to match the type of the item specified in {\tt formatstr}.
\begin{func}{Set up a message to be re-read}
\begin{verbatim}
msg_rewind (msg_p)
message *msg_p;
\end{verbatim}
\end{func}
\index{\tt msg\_rewind}\index{messages, prepare to rescan}
\begin{defs}
\defn{msg\_p} Message pointer.
\end{defs}
You may wish to re-read the contents of a message from the
beginning after one or more calls to {\tt msg\_get} have been made.
{\tt msg\_rewind} sets the ``read indicator''
back to the beginning of a message, so that the next call to {\tt msg\_get}
will start with the first item.
Calls to {\tt msg\_put} are unaffected by {\tt msg\_rewind}.
\begin{func}{Release the space allocated for a message}
\begin{verbatim}
msg_delete (msg_p)
message *msg_p;
\end{verbatim}
\end{func}
\index{\tt msg\_delete}\index{messages, deleting}
\begin{defs}
\defn{msg\_p} Message pointer.
\end{defs}
{\tt msg\_delete} releases the space allocated for a message,
unless you have incremented its reference count (see below), in which case
it decrements the reference count.
When the reference count drops down to its initial value, the space is
released.
The following rules explain when to call {\tt msg\_delete}.
\index{messages, deleting}
\begin{itemize}
\item You {\bf must} call {\tt msg\_delete} for any message you create by
calling {\tt msg\_newmsg} or {\tt msg\_gen}.
\item You {\bf must} call {\tt msg\_delete} for any messages you obtain
using the {\tt \%m} format item in a call to {\tt msg\_get}.
\item You {\bf must not} call {\tt msg\_delete} for any message delivered
to you as an argument to a task.
The system automatically deletes such messages when the task terminates.
\item You {\bf must} call {\tt msg\_delete} once for every call to {\tt
msg\_increfcount} (see below).
\end{itemize}
If {\tt msg\_delete} is called with a bad argument or called too many times on the
same message, ISIS produce an error message about a message {\tt block\_desc}
having an incorrect reference count.
\begin{func}{Increment the reference count of a message}
\begin{verbatim}
msg_increfcount (msg_p)
message *msg_p;
\end{verbatim}
\end{func}
\index{\tt msg\_increfcount}\index{messages, increment reference count}
\begin{defs}
\defn{msg\_p} Message pointer.
\end{defs}
The system automatically deletes messages delivered as the argument of a
task when the task terminates.
If you wish to retain a message beyond
task termination, you can increment its reference count.
The message will then be retained until {\tt msg\_delete}
is called explicitly once for
each increment of the reference count.
\begin{func}{Print brief description of a message}
\begin{verbatim}
pmsg (msg_p)
message *msg_p;
\end{verbatim}
\end{func}
\index{\tt pmsg}\index{messages, printing contents}
\begin{defs}
\defn{msg\_p} Message pointer.
\end{defs}
This routine prints the header of a message, giving its size in
bytes, the current reference count, the sender and destinations (if any),
and the message-id, if known.
If the message is actually contained within another message, it may be
identified as a PSEUDO-MSG, but the distinction is a purely technical one.
\begin{func}{Print detailed description of a message}
\begin{verbatim}
msg_printaccess (msg_p)
message *msg_p;
\end{verbatim}
\end{func}
\index{\tt msg\_printaccess}\index{messages, printing contents}
\begin{defs}
\defn{msg\_p} Message pointer.
\end{defs}
This routine prints a detailed description of a message, listing each
of its fields by name, type, length and address of contents, and recursively
printing any messages contained in the argument message.
Also shown are the data blocks within which the message is stored and the
IO vectors to be used in transmitting this message, if these have been
computed. A great deal of information is presented, but some will be
too technical for the typical reader to make sense out of.
\input{format}
\input{msg_examples}
mber of items read (0 if there are no more items).
It aborts the scanmessages.tex 666 525 212 100647 4727502065 6542 In the introduction, we saw some examples of how the message facility is used.
This section treats several message routines more systematically.
\index{messages}
In most systems, a message is just a collection of bytes---a text
string, or some other simple data structure that can be transmitted from
place to place.
ISIS has a more sophisticated notion of a message: a list of ``fields''
which each have a name, type and value.
One can add fields to a message, look up a field in a message, etc.
Knowing the type of a field, ISIS can automatically convert its representation
from the one used by the sending machine to the one used on a destination
machine; thus, an integer put into a message on a VAX can be
extracted and used on a SUN without any special precautions or byte-swapping.
A simple, {\tt printf/scanf} style of interface is provided, together with
lower-level routines used internally by the system and available to the
user for use in special situations.
A message physically lives within the
address space of the process that created it or read it from an input channel.\footnote{
When ISIS runs under MACH, it uses MACH's virtual IO features in a way
that leads to copy-on-write message sharing when the same message is
needed in several processes on a single machine. However, this
mechanism is only supported on MACH at present.}
However, because ISIS supports multiple lightweight tasks within
a single address space, situations in which a message must be shared
between several concurrent tasks often arise.
A reference
count scheme is therefore used to keep track of the number of
independent pointers to a given message.
This feature is also used when pointers to a message are to be
placed on several data structures within a process.
ISIS implements the rule that a message will be deleted, and the
storage it occupied reclaimed, only when no references to it remain---that
is, when its reference count is decremented to 0.
The message data structures used by ISIS are complicated, and users
should access them only through the routines described below.
\section {Creating and deleting a message}
To create a message, the user would normally call the routine
\begin{verbatim}
message *msg_newmsg();
\end{verbatim}
\index{messages, creating initially empty}\index{\tt msg\_newmsg}
This function creates an empty message (one containing no fields)
and returns a pointer to it.
The reference count of the message is initially 1.
A second way of creating a message is to use {\tt msg\_gen}, as described
below; this fills the message with data at the same time.
The creator of a message is responsible for deleting it later:
\begin{verbatim}
msg_delete(mp)
message *mp;
\end{verbatim}
\index{messages, deleting}\index{\tt msg\_delete}
This routine decrements the reference count of the message; if the result
is 0, the storage associated with the message is reclaimed.
To increment the reference count of a message one calls:
\begin{verbatim}
msg_increfcount(mp)
message *mp;
\end{verbatim}
\index{messages, increment reference count}\index{\tt msg\_increfcount}
One can also make a copy of a message, by calling
{\tt msg\_copy(mp)}.\index{\tt msg\_copy}
Such a copy can be changed independently from the original;
no real copying is done until the first such change actually occurs.
If the goal is not to modify the message,
{\tt msg\_increfcount}
provides a much cheaper way to accomplish essentially the same thing.
It is extremely important to pair calls of {\tt msg\_increfcount}
and {\tt msg\_delete}. Unpaired calls can result in freeing
a message twice (which will usually crash your process) or
a memory leak (which will cause your process to gradually become
more and more bloated until it creaks to a slow and horrible halt).
In the introduction to this manual,
we saw how ISIS calls a user-defined entry
each time a message is received by a process.
ISIS creates this message before calling the entry, and deletes it
as soon as it returns.
The user's code is not responsible for doing this delete,
nor can it prevent it from being done.
Thus, it would be necessary to use {\tt msg\_increfcount} in an
application that will save a received message for later use.
Were the message to be stored in several independently managed data
structures, {\tt msg\_increfcount} might well be called once per
data structure, and
{\tt msg\_delete} called each time the message is removed from one of
these structures.
\index{messages, deleting}
Several other message routines, described below,
create new messages: {\tt msg\_gen, msg\_getmsg, msg\_getmsgs,
msg\_read,} and {\tt msg\_fread}.
In each case, the code that did the create is responsible for doing a
delete when the message is no longer needed.
The following example illustrates the correct use of {\tt msg\_increfcount}
and {\tt msg\_delete}.
On receiving a message, a service determines that it cannot send a response
immediately.
This might arise, for example, in a matchmaker service that
sometimes gets requests for some resource before the process supporting
that resource registers itself with the service.
The service would have a choice: either the match request should block,
waiting for the the provider to declare itself, or the request should
be saved somewhere reasonable.
The latter is a more efficient solution because tasks use up a lot of
memory, even when they are blocked.
So, the matchmaker will do something like the following:
\index{{\em matchmaker} service (example)}
\begin{verbatim}
match_req(mp)
message *mp;
{
if(the request in mp can be matched right now)
{
reply(mp, "try talking to %A[1]", &who);
return;
}
msg_increfcount(mp);
save_request(mp);
}
\end{verbatim}
Later, the resource manager will finally show up:
\begin{verbatim}
resrc_register(mp)
message *mp;
{
while(smp = saved_request(...))
{
reply(smp, "try talking to %A[1]", &who);
msg_delete(smp);
}
}
\end{verbatim}
The idea is that the request message, which normally would have been
deleted when {\tt match\_req} finished, will now be saved
if necessary until the {\tt msg\_delete} is done in {\tt resrc\_register}.
\section{The msg\_put, msg\_gen, and msg\_get routines}
Most messages are manipulated using a pair of routines
patterned after the C library {\tt printf}: {\tt msg\_put} and
{\tt msg\_get}.
A call to {\tt msg\_put} looks like:
\begin{verbatim}
msg_put(mp, fmt, additional arguments ...)
message *mp;
char *fmt;
\end{verbatim}
\index{\tt msg\_put}\index{messages, copy data into using format}
\index{format strings}
\index{\tt msg\_gen}\index{messages, create using format}
{\tt msg\_gen} is just like {\tt msg\_put} except that it creates
a new message and returns a pointer to it.
The format string is composed of text, which will be
ignored, together with format items that look like {\tt \%x},
where {\tt \%x} is drawn from the list below.
The idea is that the text part can include information documenting
the format items, which are the only part of the format that
the {\tt msg\_get} and {\tt msg\_put} routines look at.
\index{\tt msg\_get}\index{messages, copy data out from using format}
Lower-case format items denote an argument passed by value: thus
{\tt \%d} denotes a 32-bit integer, and {\tt \%c} a single character.
Some lower-case format items are not permitted in {\tt msg\_put}, namely those
that would require passing of structures by value {\tt \%a, \%b, \%e, \%p}.
This restriction is a recent one; it comes about because we found that by-value
argument passing wasn't portable and that in many cases the UNIX varargs
mechanism was unable to correctly extract such arguments from the stack.
There is no restriction on using formats like {\tt \%a} in calls to {\tt msg\_get}.
At the end of this section, we explain how a user can define
additional format items of his own, binding them to any of the
unused format characters.
Upper-case format items denote an argument consisting of a vector
of the corresponding base type; thus, {\tt \%D} and {\tt \%A} for vectors of
\index{array format items}
integers and addresses respectively.
In the case of a vector a {\tt length} argument is normally required immediately
after the vector address; this is always given in array elements, not bytes.
A short-hand is provided for cases where the length will be a simple constant (e.g. 1):
{\tt \%X[1]}. For example, the recommended way to put an address
into a message is using the format {\tt \%A[1]}. This requires a single
argument, namely the address of the object (in this case, the address) to be
copied into the message. In a call to {\tt msg\_get}, if the length is
specified using the {\tt \%X[n]} notation, no length pointer is expected (see
example, below).
For each format element, a new field is added to the message.
Other characters within the format string are ignored.
The format characters currently supported are:
\index{format items}
\begin{description}
\item [{\tt \%s}] The data is a null-terminated character string;
the argument is a pointer to it (see notes below).
\item [{\tt \%S}] Not currently supported.
\item [{\tt \%c}] The data is a single byte, passed by value.
\item [{\tt \%C}] The data is an array of bytes.
As described above,
the argument is given in two parts: a pointer to the beginning
of the array, and its length.
\item [{\tt \%d}] A 32-bit integer.
\item [{\tt \%D}] A vector of 32-bit integers.
\item [{\tt \%h}] A short (16-bit) integer.
\item [{\tt \%H}] A vector of short integers.
\item [{\tt \%l}] Same as {\tt \%d}.
\item [{\tt \%L}] Same as {\tt \%D}.
\item [{\tt \%m}] The data is a pointer to a message.
\item [{\tt \%M}] Not currently supported.
\item [{\tt \%f}] The data is a floating point (single precision) numbers.
\item [{\tt \%F}] The data is a vector of floating point (single precision) numbers.
\item [{\tt \%g}] The data is a floating point (double precision) number.
\item [{\tt \%G}] The data is a vector of floating point (double precision) numbers.
\item [{\tt \%a}] The data is an address of a process or process group.
\item [{\tt \%A}] The data is a vector of process or process group addresses.
If the vector is null-terminated, the number of elements in the vector
including the null address may be computed as {\tt alist\_len(vec) + 1}.
\item [{\tt \%e}] The data is an event-id (see bcast f option).
\item [{\tt \%E}] The data is a vector of event-ids.
\item[{\tt \%b}] A variable of type {\tt bitvec}.
\item[{\tt \%B}] An array of type {\tt bitvec}.
\item[{\tt \%p}] A process-group view.
\item[{\tt \%P}] An array of process-group views.
\end{description}
Notes:
Multiple calls to {\tt msg\_put} have the same effect as would a single
call using the concatenated format strings and argument lists.
The call returns $0$ normally, and $-1$ if an error is detected.
In these cases, {\tt isis\_errno} is set to indicate the nature of the error.
When sending large data objects, the format item {\tt \%*X} is
recommended. This works like {\tt \%X} but the data is not actually
copied into the message; instead ISIS keeps a pointer to it.
An additional argument is required: a callback routine that ISIS
should invoke when the indirect data reference is no longer being
saved in the message, presumably because the message has been
deleted for the last time.
The user must not change the contents of an indirectly referenced
data buffer until this callback occurs.
Information inserted using {\tt msg\_put} or {\tt msg\_gen} can be
extracted using {\tt msg\_get}:
\begin{verbatim}
nfound = msg_get(mp, fmt, additional arguments ...)
message *mp;
char *fmt;
\end{verbatim}
The message is scanned using the designated format,
and the number of format items extracted is returned.
Each subsequent call resumes where the previous one left off (to rescan
a message, call {\tt msg\_rewind(mp)}).
The error code -1 is returned if a data field does not match the
type of the corresponding format item, or an illegal format item is
discovered, or if the length of an array item is specified using the
{\tt \%X[n]} notation and the field in the message has a different length.
In these cases, scanning ceases and
{\tt isis\_errno} is set to indicate the nature of the error.
In many cases, it is possible to use exactly the same format string and
arguments in {\tt msg\_put} and {\tt msg\_get},
giving the addresses to store the data and the
corresponding lengths instead of the values for these arguments
(if the length is not desired, a null-pointer can be supplied instead).
However, this involves copying information from the message into
predefined vectors, and in some cases the application will not
know how long these vectors should be.
Consequently, two variants on the upper-case format items are
provided.
Items of the form {\tt \%-X} set user-supplied pointer variables to
point into the message itself, to the body of the appropriate field.
This involves no copying, but requires that the user be careful
to keep the message around as long as the pointer is in use.
Items of the form {\tt \%+X} cause the contents of the
\index{automatic {\tt malloc} using {\t \%+X} format items}
corresponding fields to be copied into areas that are dynamically allocated
using {\tt malloc}, and set user-supplied pointer variables to
point to these areas.
This option is also supported for lower-case format items.
For example, {\tt \%s}: {\tt \%-s} gives a pointer
into the message, and {\tt \%+s} allocates a new copy.
In the cases where a {\tt malloc}
is done, the validity of the pointers is not tied to the continued
existence of the parent message, but the user is expected to
call {\tt free} when the pointer is no longer needed.
\index{pointer into message using {\tt \%-X} format items}
Zero-length data fields are signaled to the user by returning a null-valued pointer
in the case of a {\tt \%+x} or {\tt \%-x} format item.
{\em Note: }
A common source of errors relates to the fact that
\index{pointer into message using {\t \%-X} format items}
{\tt \%s} will {\em copy} bytes from the message into an array that
the user is expected to have pre-allocated, much like {\tt \%C}.
If you have not pre-allocated space for the string (i.e. if the
string was declared to be of type {\tt char *str} rather
than {\tt char str[LEN]}), you
will want to use format item {\tt \%+s} or {\tt \%-s} and
specify the address of a pointer variable for the
corresponding argument.
For example, assume a user-program creates a message as follows:
\begin{verbatim}
gen_proc()
{
{
register message *mp = msg_newmsg();
int a, vec[10];
. . .
msg_put(mp, "%d,%D,%s", a, vec, 10, "frog");
. . .
}
}
\end{verbatim}
then the following could be used to extract these variables from the message:
\begin{verbatim}
unpack_proc(mp)
register message *mp;
{
int a, vec[10], veclen;
char *str;
. . .
(void)msg_get(mp, "%d,%D,%-s", &a, vec, &veclen, &str);
. . .
}
\end{verbatim}
After the {\tt msg\_get} call returns,
data will have been copies from the message into {\tt vec}.
{\tt str} will point to a string stored within
message {\tt mp}.
It would be incorrect to reference {\tt str} after this message has been deleted.
Notice that we were expected to know in advance that {\tt vec}
contained enough room for the vector of integers contained in the message.
Had {\tt vec} been too small, the copy would have overrun the array
with unpredictable consequences.
A version that stores a reference to {\tt vec} to avoid copying follows.
This would only be beneficial for a large vector of at least 256 bytes
of data:
\begin{verbatim}
gen_proc()
{
{
register message *mp = msg_newmsg();
int a, vec[1000], vec_freed();
. . .
msg_put(mp, "%d,%*D,%s", a, vec, 1000, vec_freed, "frog");
. . .
}
}
\end{verbatim}
ISIS will call {\tt vec\_freed(vec)} when message {\tt mp} is
deallocated; the contents of {\tt vec} must not be changed prior to
this callback.
A variant of the receive routine
that does no copying, and does not require advance knowledge
of the length of {\tt vec[]}, would be as follows:
\begin{verbatim}
unpack_proc(mp)
register message *mp;
{
int a, *vec, len;
char *str;
. . .
(void)msg_get(mp, "%d,%-D,%-s", &a, &vec, &len, &str);
. . .
}
\end{verbatim}
Here, {\tt vec} will be set to point into the message after the call to
{\tt msg\_get}, as for the string {\tt str} points to.
These pointers are meaningful only as long as the message has not been
deleted.
A third variant, using the {\tt malloc} feature, is as follows:
\begin{verbatim}
unpack_proc(mp)
register message *mp;
{
int a, *vec, len;
char *str;
. . .
(void)msg_get(mp, "%d,%+D,%+s", &a, &vec, &len, &str);
. . .
free(vec);
free(str);
}
\end{verbatim}
Here, {\tt free} is called to release the dynamically allocated storage.
Notice that {\tt \%*D} is a little different from {\tt \%+D} or
{\tt \%-D}: the former is used when the caller is managing
some sort of data buffers, and stores a pointer into the user's
buffer within the message; on reception the data will
be found in the message, but until transmission ISIS
avoids any copying.
The latter format items give explicit control
over whether ISIS copies on reception, and if so where it copies the
data to. Copying is completely avoided in applications that
use the {\tt \%*D} on sending and {\tt \%-D} format on
reception.
Another common source of confusion stems from the fact that {\tt msg\_get}
supports the {\tt \%a} format item, but {\tt msg\_put} does not.
Intuitively, this should be easy to understand: in the case of {\tt msg\_get},
by-value argument passing is not being used, whereas {\tt msg\_put} would
interpret format {\tt \%a} to mean ``an address passed by value''.
As noted above, we found, somewhat painfully, that by-value argument
passing of this sort is simply not portable.
Here is a simple call to generate a message containing an address
and the corresponding {\tt msg\_get} calls:
\begin{verbatim}
addr_demo()
{
register message *mp;
address *addr_p = pg_lookup("group");
address addr_v;
mp = msg_gen("%A[1],%A[1]", addr_p, &my_address);
...
msg_get(mp, "%-a,%A[1]", &addr_p, &addr_v);
}
\end{verbatim}
\section{Message fields}
We mentioned above that each message is really a collection of fields.
Internally, each message field has an identifier number, a type, data, a length, and
an instance number.
A field id is an 8-bit integer; the system reserves the ``negative''
values for internal use, and the user is permitted to define
fields 1-127 for application-specific purposes.
The field id may not be 0.
For example, {\tt msg\_put} names the fields it creates {\tt SYSFLD\_SCAN},
which has a negative value.
Each format item corresponds to a new instance of this field.
A field type is used to tell ISIS how the data in the field is
represented, for purposes of byte-swapping when the message is sent
from one machine to another.
For example, the VAX and SUN computers order the bytes within an integer
differently in memory, hence an integer written by a VAX is not
directly readable on a SUN without re-ordering.
Each field is treated as an array of 1 or more elements of some base
type.
In addition to the basic types known by ISIS, corresponding to the
predefined format items,
a user can also define additional field types of his own.
The creation of new field types is discussed below.
The data part of a field, which may be null, consists of a series of
zero or more bytes copied from a location specified by the process
that created the field.
The length of a field gives the length, in bytes, of the data part.
For a string, this would include the null-byte that terminates the
string.
Other than the restriction that field id be in the range 1-127,
there is no limit on the number of instances of a field that a message
may contain, nor on the maximum size of a message.
One-time transmission of a message containing as much as several
megabytes of data should work correctly.
However, ISIS tends to choke when huge messages are sent
rapidly without waiting for acknowledgements.
A limit of about 8k bytes is recommended when
a series of messages is to be transmitted asynchronously.
A drawback of {\tt msg\_put} is that it creates a single
sequential list of fields within a message.
Layered applications may need to
store several orthogonal sets of information in a single message,
and hence need
the ability to control the field that {\tt msg\_put} and {\tt msg\_get} will
use.
There are variants of {\tt msg\_put} and {\tt msg\_get} for
this purpose.
To use them, the user picks the field id, in the range 1-127,
and then calls:
\begin{verbatim}
msg_putfld(mp, fid, fmt, args...)
\end{verbatim}
or, to retrieve from such a message
\begin{verbatim}
msg_getfld(mp, fid, &position, fmt, args...)
\end{verbatim}
\index{\tt msg\_putfld}\index{\tt msg\_getfld}\index{messages, fields}
The {\em position} argument is used by {\tt msg\_getfld} to
keep track of its position within the message, so that successive calls
will scan the message sequentially.
It can be specified as a null-pointer if only one call to {\tt msg\_getfld}
will be done; otherwise, it should point to an integer variable
in which this positional information can be stored.
The variable should be zero initially; the equivalent effect of
a {\tt msg\_rewind} can be had by zeroing this variable before calling {\tt msg\_getfld}.
\index{messages, scanning position}\index{messages, prepare to rescan}\index{\tt msg\_rewind}
It is possible to determine the type of a field using the call
\begin{verbatim}
type = msg_gettype(mp, fid, &position);
\end{verbatim}
\index{\tt msg\_gettype}\index{messages, fields}\index{\tt msg\_printaccess}
A list of the types, locations and lengths of the fields in a message
can be printed by calling
\begin{verbatim}
type = msg_printaccess(mp);
\end{verbatim}
\section{Defining new field types}
In most cases, it should be possible to create a format string
for transmitting even a complex data structure, field by field.
However, there are situations in which this becomes very inconvenient.
For example, it is hard to transmit an array of structure elements
using this approach.
ISIS permits the user to define new format items using a subroutine call
in which routines for doing any necessary byte swapping are specified
explicitly.
To define a new type call:
\index{messages, defining new format types}
\index{format items, defining new ones}
\index{\tt isis\_define\_type}
\begin{verbatim}
isis_define_type(formatletter, size, converter)
char formatletter;
int size;
int (*converter)();
\end{verbatim}
The interpretation of the arguments is as follows.
The format letter is the ascii character that will be used in format items
referring to this type; it must not redefine a previously used format item
({\tt a b c d e f h l m p s})
because the ISIS toolkit routines use most of them and would get confused by a change.
The size gives the length, in bytes, of an element of this type.
{\em Note: prior to V2.1, this routine took an additional argument, called
the {\tt typeno}. This argument is no longer supported.}
As for other sorts of structures, lower-case format items using a defined
type are illegal in on calls to {\tt msg\_put}.
The converter routine is responsible for mapping a data item of the
specified type from the byte format of a sending machine into the
byte format of the receiving machine.
It does this by repeated calls to the built-in conversion routines listed
below:
\index{\tt msg\_convertchar}
\index{\tt msg\_convertshort}
\index{\tt msg\_convertlong}
\index{\tt msg\_convertaddress}
\index{\tt msg\_convertsiteid}
\index{\tt msg\_convertpgroup}
\begin{verbatim}
msg_convertchar (char_p)
msg_convertshort (short_p);
msg_convertlong (long_p);
msg_convertaddress (address_p)
msg_convertsiteid (site_id_p);
\end{verbatim}
The idea is that the converter will be called with a single argument, pointing
to the data item to byte swap.
It should coerce this to a pointer to a structure of the type it expects,
and then repeatedly call the built-in routines on the subfields of that
structure, passing the address of the data item to byte-swap.
The approach is done with in-place byte swapping, hence it cannot
accommodate differences in the way structures are ``padded'' from machine
to machine.
The routine {\tt msg\_convertchar} is really a no-op since no conversion is
required for one-byte quantities; it is merely provided for completeness.
For example, given a data structure:
\begin{verbatim}
struct emp_record
{
char employee[8];
long empid;
long salary;
short age;
char sex;
... etc
};
\end{verbatim}
one could define the format item {\tt \%x} to be message-type 10 (currently
unused) as follows:
\begin{verbatim}
isis_define_type('x', sizeof(struct emp_record), emp_conv);
\end{verbatim}
where the routine {\tt emp\_conv} is coded:
\begin{verbatim}
emp_conv(ep)
struct emp_record *ep;
{
register n;
for (n = 0; n < 8; n++)
msg_convertchar(&ep->employee[n]);
msg_convertlong(&ep->empid);
msg_convertlong(&ep->salary);
msg_convertshort(&ep->age);
msg_convertchar(ep->sex);
}
\end{verbatim}
Having done this, {\tt msg\_put} and {\tt msg\_get} will
understand the format items {\tt \%x, \%X, \%+X, and \%-X.}
The calls to {\tt msg\_convertchar} can be omitted if desired, since this
routine is a no-op.
Notice that the employee record has been declared in a way that
makes padding unlikely.
Were padding necessary, for example to align long-integers onto 32-bit
boundaries, problems could arise when
transmitting these records to a machine that uses different padding rules.
The arriving structure would not be organized in the way expected by the
conversion routine, which would scramble the data instead of converting it
into the reception machine's format.
One way to handle structures that need padding is to encode and decode them
using a method such as the XDR scheme, which is awkward to use but very
general, and then tell ISIS to treat the encoded data just as a byte-string
of some length.
\index{XDR encodings, used within ISIS}
One funny thing about XDR streams is that they come in different
flavors. Some, like {\tt xdrstdio}, automagically write to files (or
sockets, of course), so short of reading the file there is no way to
capture the contents of the stream. There is an {\tt xdrrec} flavor that
lets the user supply the routines to be called on record boundaries,
so you could save up the output and then read it on the other end in
the same way. Finally, there is an {\tt xdrmem} flavor that writes to a
\index{{\tt xdrmem}, XDR stream used with memory object}
chunk of memory. That last form is the one to use if you
need to employ XDR within ISIS.
The idea is to generate XDR output inside a memory area, then put
the resulting encoded data into your message using
{\tt \%C}
or, to avoid copying,
{\tt \%*C}.
On reception, a pointer to the data can be
obtained using the {\tt \%+C} format.
It is easy to hack up an {\tt xdrmem\_resize},
\index{{\tt xdr\_resize}, change size of an XDR data object}
which you can use when an XDR write fails to make the buffer bigger.
If you are using {\tt xdrmem}
streams, {\tt xdr\_getpos} seems to return the size
\index{{\tt xdr\_getpos}, size of an XDR data object}
(in bytes) of the stream, although it is not clear
if this behavior is guaranteed for all possible XDR implementations.
\section{Other useful message routines}
The sender of a message can be determined by calling
\begin{verbatim}
address *msg_getsender(mp)
message *mp;
\end{verbatim}
This information cannot be forged; although a routine {\tt msg\_setsender}
is defined, it has no effect if called by user-level code.
When a message may have been forwarded, using {\tt forward},
{\tt msg\_getsender} returns the address of the original sender of the
request.
Should it be necessary to obtain the address of the ``most recent'' sender
-- possibly a process that called {\tt forward}---then {\tt msg\_gettruesender}
should be used:
\begin{verbatim}
address *msg_gettruesender(mp)
message *mp;
\end{verbatim}
\index{messages, determine most recent sender}\index{\tt msg\_gettruesender}
\index{messages, determine original sender}\index{\tt msg\_getsender}
\index{messages, determine if forwarded}\index{\tt msg\_isforwarded}
One can determine whether a message has been forwarded by calling
{\tt msg\_isforwarded(mp)}, which returns $0$ or $1$ accordingly.
Associated with each message is an identification number, stored as a long
integer, and accessible by calling
\begin{verbatim}
long msg_getid(mp)
message *mp;
\end{verbatim}
\index{messages, identification number}\index{\tt msg\_getid}
The pair {\tt (msg\_getsender(mp),msg\_getid(mp))} can be used as a unique
identifier for a given message.
Moreover, message identification numbers work like ``logical clocks,''
in that a process that has received a message with identification
number $n$ will subsequently generate messages with identification
numbers greater than $n$.
This property is used primarily by the logging facility, but you
can take advantage of it if desired.
Message id's also encode a small amount of information about the type of
message being sent.
A message for which someone may want a reply will have an odd-numbered
message id.
A message which was sent asynchronously will have an even-numbered id.
ISIS most often uses the identification number of a message in conjunction
with other information about it.
This data is collected into what is called an {\tt event\_id},
and is described in Chapter 7.
To summarize, each time a message is broadcast, a broadcast event id
is assigned to the broadcast.
This event id is available to the sender of the broadcast if the {\tt `t'}
option is used, and is also available to the recipients in the global variable
{\tt my\_eid}, which is of type {\tt event\_id}.
The destinations to which a message was sent are saved in the message
as a vector of addresses terminated by a NULLADDRESS.
A pointer to this vector is returned by
\begin{verbatim}
address *msg_getdests(mp)
message *mp;
\end{verbatim}
\index{messages, determine destinations}\index{\tt msg\_getdests}
The destination list returned by this function will contain process
addresses {\em with the entry numbers filled in.}
For example,
by scanning for the address(es) on which {\tt addr\_ismine(3)} returns TRUE
and checking the entry number, one can determine which entries
will receive a copy of a given message.
This is how ISIS determines which tasks to create upon reception of a message.
The length in bytes of a message can be determined by calling
\begin{verbatim}
length = msg_getlen(mp)
message *mp;
\end{verbatim}
\index{messages, determine length}\index{\tt msg\_getlen}
\section{Input and output of messages}
It is possible to write messages to IO channels and streams, for
example to save them in a file:
\begin{verbatim}
outcome = msg_write(fdes, mp)
message *mp;
\end{verbatim}
\index{messages, write to file descriptor}\index{\tt msg\_write}
The outcome will be 0 if the message was successfully output to the
designated file descriptor, and -1 otherwise.
The UNIX errno variable will contain an error code in this case.
Similarly:
\begin{verbatim}
outcome = msg_fwrite(file, mp)
FILE *file;
message *mp;
\end{verbatim}
outputs the message to the named stream.
\index{messages, write to stream}\index{\tt msg\_fwrite}
To read messages from IO channels or streams:
\begin{verbatim}
message *msg_read(fdes)
\end{verbatim}
\index{messages, read from file}\index{\tt msg\_read}
\index{messages, read from stream}\index{\tt msg\_fread}
and
\begin{verbatim}
message *msg_fread(file)
FILE *file;
\end{verbatim}
Both routines return null pointers if an IO error occurred or
if the entity read from the file was not a message.
Note that these routines work only on message streams, not UDP
style packet connections. ISIS does not support a UDP level
message input/output mechanism.
{\em Warning: ISIS has been known to crash if it tries to read a message on
a channel and the object it reads in is not a message, or has been corrupted.
Although a reasonable attempt is made to verify that the object being
read from a channel is actually a message, this code is not
100\% foolproof. ISIS will, however, never crash when trying to read
a message that had been partially written at the
moment when a system crash occurred. }
ngly.
Associated with each message is an identification number, stored as a long
integermeta.tex 666 525 212 214 4603432210 5551 \label{Ch:meta}
This chapter has not yet been written. The interested reader is
referred to the file ~isis/meta1.2/doc/fd.doc for details.
message *mp;
\end{verbatim}
\index{messages, determine length}\index{\tt msg\_getlen}
\section{Input and output of messages}
It is possible to write messages to IO channels and streams, for
example to save them in a file:
\begin{verbatim}
outcome = msg_write(fdes, mp)
message *mp;
\end{verbatim}
\index{messages, write to file descriptor}\index{monitor.tex 666 372 212 26722 4673505560 6406 \section{Monitors and watches}
A process uses\index{process groups, monitoring membership}
\index{process groups, watching individual members}
\index{site views, monitoring}
monitors and watches to ask the ISIS system to inform it
when a particular event or type of event occurs.
When a process sets a monitor or a watch, it gives the system the name of a
routine and an argument for the routine.
When the event occurs, the named routine is run as a task within the
process, with the argument being passed as a parameter.
We say that the process has received a monitor or watch notification.
A watch is used for a one-time event like the termination of a specific
process, while a monitor is used for a class of events (e.g. to
be notified of all future changes to the membership of a particular process
group.)
Once a watch event has triggered, it will never trigger a second time.
A monitor event will trigger repeatedly each time the event in question
occurs, until it is explicitly canceled.
All monitor or watch notifications for the same event form a single
distributed action that is ordered relative to all other monitor or watch
notifications and to broadcast message delivery.
In other words, all processes will receive
notifications in exactly the same order.
Also, it will never be the case that one process receives a broadcast
message {\em before} a particular notification, while
another process receives the same message {\em after} that notification.
In terms of our virtual synchrony model, then, watch or monitor
notifications are virtually synchronous distributed events.
\begin{func}{Monitor changes to the site view}
\begin{verbatim}
int
sv_monitor (routine, arg)
int (*routine)();
int arg;
\end{verbatim}
\end{func}
\index{site views, monitoring continuously}\index{\tt sv\_monitor}
\begin{defs}
\defn{routine} Routine to be called if the site view changes.
\defn{arg} Argument to be passed to {\tt routine}.
\end{defs}
If the site view changes (a site failure or recovery is detected by the
system), a task is started to execute the function call
\newline\hspace*{2em}\verb|routine (sview_p, arg);|\newline
{\tt sview\_p} is a pointer to the new site view (Section 4.1).
{\tt sv\_monitor} returns an integer monitor-id that can be used to cancel
the monitor.
\begin{func}{Cancel a site view monitor}
\begin{verbatim}
sv_monitor_cancel (sv_monitor_id)
int sv_monitor_id;
\end{verbatim}
\end{func}
\index{site views, cancel monitor}\index{\tt sv\_monitor\_cancel}
\begin{defs}
\defn{sv\_monitor\_id} Monitor-id of monitor to cancel.
\end{defs}
This function returns $0$ in the normal case. It takes no action and returns
$-1$ if given a meaningless monitor-id or if the
monitor has already been canceled.
\begin{func}{Monitor changes to the membership of a process group}
\begin{verbatim}
int
pg_monitor (gaddr_p, routine, arg)
address *gaddr_p;
int (*routine)();
int arg;
\end{verbatim}
\end{func}
\index{process groups, monitoring membership}\index{\tt pg\_monitor}
\begin{defs}
\defn{gaddr\_p} Address of process group.
\defn{routine} Routine to be called if the group membership changes.
\defn{arg} Argument to be passed to {\tt routine}.
\end{defs}
If the group membership changes, a task is started to execute the function
call
\newline\hspace*{2em}\verb|routine (gview_p, arg);|\newline
{\tt gview\_p} is a pointer to the group view data structure (Section 4.5)
for the named group.
{\tt pg\_monitor} returns an integer monitor-id that can be used to cancel
the monitor.
The value will be positive normally; -1 if the group was unknown or
the caller was not a member or client.
\begin{func}{Watch a group for total failure}
\begin{verbatim}
int
pg_detect_failure (gaddr_p, routine, arg)
address *gaddr_p;
int (*routine)();
int arg;
\end{verbatim}
\end{func}
\index{process groups, monitoring for total failure}\index{\tt pg\_detect\_failure}
\begin{defs}
\defn{gaddr\_p} Address of process group.
\defn{routine} Routine to be called if the group membership falls to zero
\defn{arg} Argument to be passed to {\tt routine}.
\end{defs}
This facility is normally used by a {\em non-member} (and non-client) of the
group.
For example, a process that has done a {\tt pg\_lookup} to get the
address of a group, but didn't join it or become a client, might use
{\tt pg\_detect\_failure} to learn that all members of the group have crashed.
If this occurs, a task is started to execute the function call
\newline\hspace*{2em}\verb|routine (gview_p, W_FAIL, arg);|\newline
{\tt gview\_p} is a pointer to the group view data structure (Section 4.5)
for the named group. {\tt W\_FAIL} is an {\tt int} indicating the
event that occurred.
{\tt pg\_detect\_failure} returns an integer monitor-id that can be used to
cancel the monitor.
The value will be positive normally; -1 if the group was unknown.
This facility is implemented using {\tt pg\_getview} to get the
current membership of the group and and {\tt proc\_monitor} to watch
members for failure. A consequence is that
{\tt pg\_detect\_failure} will not detect situations in which all members of
a group leave it (using {\tt pg\_leave}) or the group is deleted (using {\tt pg\_delete}).
\begin{func}{Cancel a process group monitor}
\begin{verbatim}
pg_monitor_cancel (pg_monitor_id)
int pg_monitor_id;
\end{verbatim}
\end{func}
\begin{defs}
\defn{pg\_monitor\_id} Monitor-id of monitor to cancel.
\end{defs}
\index{process groups, cancel monitor}\index{\tt pg\_monitor\_cancel}
This function returns $0$ in the normal case, and
error code $-1$ if given a meaningless monitor-id or if the
monitor has already been canceled.
\begin{func}{Watch for a given site to fail or recover}
\begin{verbatim}
int
sv_watch (sid, event, routine, arg)
site_id sid;
int event;
int (*routine)();
int arg;
\end{verbatim}
\end{func}
\index{site views, watching individual members}\index{\tt sv\_watch}
\begin{defs}
\defn{sid} Site-id of site to watch.
\defn{event} One of {\tt W\_FAIL} or {\tt W\_RECOVER}.
\defn{routine} Routine to be called if the requested event occurs.
\defn{arg} Argument to be passed to {\tt routine}.
\end{defs}
\index{\tt W\_FAIL}\index{\tt W\_RECOVER}\index{\tt NULLSID arg. to {\tt sv\_watch}}
If {\tt W\_FAIL} is specified and the site fails, or
if {\tt W\_RECOVER} is specified and the site recovers,
a task is started to execute the function call
\newline\hspace*{2em}\verb|routine (sid, event, arg);|\newline
If {\tt sv\_watch} is called with {\tt NULLSID} as the value for {\tt sid},
{\tt routine} will be called if {\em any} site fails or recovers.
In this case, the second parameter to {\tt routine} will be the
{\tt site\_id} of the site that actually crashed or recovered.
{\em Warning:} When {\tt sv\_watch} is called with {\tt NULLSID}, and
several failure or recovery events are noted at once, the
watch routine will be invoked only once, with the arguments corresponding
to one of the events.
Hence, if this feature is used, it is recommended that you look at the
group view structure to determine the full list of events, and not rely on
the arguments to the watch routine.
{\tt sv\_watch} returns an integer watch-id that can be used to cancel
the watch.
If the requested action is meaningless (for example, a request to
watch for a site failure of a site that isn't operational, or to watch for
a recovery of a site that is already operational), the routine will return $0$.
The incarnation number of the parameter {\tt sid} is used in the case of
{\tt W\_FAIL},
but is ignored in the case of {\tt W\_RECOVER}.
If the specified group is unknown, $-1$ is returned.
\begin{func}{Cancel a site view watch}
\begin{verbatim}
sv_watch_cancel (sv_watch_id)
int sv_watch_id;
\end{verbatim}
\end{func}
\index{site views, cancel watch request}\index{\tt sv\_watch\_cancel}
\begin{defs}
\defn{sv\_watch\_id} Watch-id of watch to cancel.
\end{defs}
This function returns $0$ in the normal case, and returns error code
$-1$ if given a meaningless monitor-id or if the
monitor has already been canceled.
\begin{func}{Watch for a process to join or leave a named group}
\begin{verbatim}
int
pg_watch (gaddr_p, paddr_p, event, routine, arg)
address *gaddr_p;
address *paddr_p;
int event;
int (*routine)();
int arg;
\end{verbatim}
\end{func}
\index{process groups, watch member}\index{\tt pg\_watch}
\begin{defs}
\defn{gaddr\_p} Address of process group to watch.
\defn{paddr\_p} Address of process to watch.
\defn{event} One of {\tt W\_JOIN} or {\tt W\_LEAVE}.
\defn{routine} Routine to be called if the requested event occurs.
\defn{arg} Argument to be passed to {\tt routine}.
\end{defs}
\index{\tt W\_JOIN}\index{\tt W\_LEAVE}\index{\tt NULLADDRESS arg. to {\tt pg\_watch}}
If {\tt W\_JOIN} is specified and process {\tt paddr\_p} joins the group, or
if {\tt W\_LEAVE} is specified and process {\tt paddr\_p} leaves the group,
a task is started to execute the function call
\newline\hspace*{2em}\verb|routine (gaddr_p, paddr_p, event, arg);|\newline
If {\tt pg\_watch} is called with {\tt NULLADDRESS} as the value for {\tt
paddr\_p}, {\tt routine} will be called if {\em any} process joins or
leaves the group.
In this case, the second parameter to {\tt routine} is
the address of the process that actually joined or left the group.
Unlike site views, process group views
change by only one event at a time.
This feature can therefore be used without the risk of missing events.
{\tt pg\_watch} returns an integer watch-id that can be used to cancel
the watch.
As for {\tt sv\_watch}, $-1$ is returned if the group is
unknown, $0$ is returned if the process in
question is not currently a member or client of the specified group and
{\tt W\_LEAVE} was indicated, or if the process is already a
member and {\tt W\_JOIN} was indicated.
In the case where {\tt W\_JOIN} is specified and the process in
question fails before joining, the routine will be called, but the
event indicated will be {\tt W\_FAIL} instead of {\tt W\_JOIN}.
\begin{func}{Cancel a process group watch}
\begin{verbatim}
pg_watch_cancel (pg_watch_id)
int pg_watch_id;
\end{verbatim}
\end{func}
\index{process groups, cancel watch}\index{\tt pg\_watch\_cancel}
\begin{defs}
\defn{pg\_watch\_id} Watch-id of watch to cancel.
\end{defs}
This function returns $0$ in the normal case, and
error code $-1$ if given a meaningless monitor-id or if the
monitor has already been canceled.
\begin{func}{Watch for a process to terminate}
\begin{verbatim}
int
proc_watch (paddr_p, routine, arg)
address *paddr_p;
int (*routine)();
int arg;
\end{verbatim}
\end{func}
\index{process, watching non group member}
\index{\tt proc\_watch}\index{process groups, watching a non-member}
\begin{defs}
\defn{paddr\_p} Address of process to watch.
\defn{routine} Routine to be called if the process terminates.
\defn{arg} Argument to be passed to {\tt routine}.
\end{defs}
If the named process terminates,
a task is started to execute the function call
\newline\hspace*{2em}\verb|routine (paddr_p, W_FAIL, arg);|\newline
{\tt proc\_watch} returns an integer watch-id that can be used to cancel
the watch (see below).
The return value will be $-1$ if the specified process
does not exist or has already failed.
\begin{func}{Cancel a process watch}
\begin{verbatim}
proc_watch_cancel (proc_watch_id)
int proc_watch_id;
\end{verbatim}
\end{func}
\index{process, cancel watch}\index{\tt proc\_watch\_cancel}
\begin{defs}
\defn{proc\_watch\_id} Watch-id of watch to cancel.
\end{defs}
This function returns $0$ in the normal case,
$-1$ if given a meaningless monitor-id or if the
monitor has already been canceled.
d group is unknown, $-1$ is returned.
\begin{msg_examples.tex 666 420 420 10065 4570121550 7354 \subsection*{Examples}
\exam\label{ex:put1}
This code creates a new message, copies the values of a {\tt long} and a
{\tt short} integer into it, and later deletes the message.
\begin{verbatim}
long x;
short y;
message *msg_p;
...
msg_p = msg_newmsg();
msg_put (msg_p, "%d %h", x, y);
...
msg_delete (msg_p);
\end{verbatim}
\index{\tt msg\_put}\index{\tt msg\_delete}\index{\tt msg\_newmsg}
\exam\label{ex:put2}
This example copies the first five elements of an array of addresses
and an integer value into a message.
\begin{verbatim}
address alist[10];
int number;
message *msg_p;
...
msg_put (msg_p, "%A%d", alist, 5, number);
\end{verbatim}
Notice that the array format item {\tt \%A} corresponds to two arguments, the
\index{array format items}
address of the first element to copy ({\tt alist}) and the number of
elements to copy ({\tt 5}).
What would have happened if we had {\tt \&alist[2]} instead of {\tt
alist} in the call to {\tt msg\_put}?
Could {\tt \%A[5]} have been used in this example?
\exam\label{ex:put2+}
This example stores an indirect reference to a buffer containing
1024 integers into a message. The routine {\tt buf\_free} will be
called when the message is deleted by ISIS, signalling that the buffer can be
reused for other data.
\begin{verbatim}
int buffer[1024];
message *msg_p;
...
msg_put (msg_p, "%*D[1024]", buffer, buf_free);
\end{verbatim}
Because the size of the array was fixed at 1024, the size
argument has been omitted.
What might happen if the contents of buffer were changed before
{\tt buf\_free(buffer)} has been called by ISIS?
\exam
The data put into the message in Example~\ref{ex:put1} can be read out as
follows.
\begin{verbatim}
long a;
short b;
message *msg_p;
...
msg_get (msg_p, "%d This is the short one -> %h", &a, &b);
\end{verbatim}
\exam
The array elements added in Example~\ref{ex:put2} can be read in two
ways.
If the maximum size is known (say, 15), we could read it out as follows.
\begin{verbatim}
address addrlist[15];
int n_elem;
message *msg_p;
...
msg_get (msg_p, "%A", addrlist, &n_elem);
\end{verbatim}
After this call, the addresses in the message will be stored in the first
five elements of the array {\tt addrlist}, and {\tt n\_elem} will have the
value 5.
If the maximum size is not known, the following alternative may be used.
\begin{verbatim}
address *addrlist_p;
int n_elem;
message *msg_p;
...
msg_get (msg_p, "%+A", &addrlist_p, &n_elem);
...
free ((char *) addrlist_p);
\end{verbatim}
Either alternative can be followed by
\begin{verbatim}
int num;
...
msg_get (msg_p, "%d", &num);
\end{verbatim}
This will store the value of {\tt number} in Example~\ref{ex:put2}
into the variable {\tt num}.
\exam
The following example creates two messages, copies some data into each of
them, then puts a copy of the first message into the second.
Later, it copies all the data out again.
\begin{verbatim}
int a1, a2, b2, b2, n_elem;
char c1, c2, d1[10], d2[10];
message *m1, *m2, *m3;
...
m1 = msg_newmsg();
m2 = msg_newmsg();
...
msg_put (m1, "%d%c", a1, c1);
msg_put (m2, "%d %C", b1, d1, 10);
msg_put (m2, "%m", m1);
...
msg_get (m2, "%d %C %m", &b2, &d2, &n_elem, &m3);
msg_get (m3, "%d %c", &a2, &c2);
...
msg_delete (m1);
msg_delete (m2);
msg_delete (m3);
\end{verbatim}
At the end of this code, {\tt a2}, {\tt b2}, {\tt c2}, and {\tt d2} will
have the values that were in {\tt a1}, {\tt b1}, {\tt c1}, and {\tt d1}
respectively.
Notice that {\tt m1}, {\tt m2}, {\em and the message {\tt m3} obtained from
{\tt msg\_get}} must be deleted explicitly.
Be sure that you follow the rules given under {\tt msg\_delete} if
you plan to use this format item.
Failing to do so will cause memory ``leaks'' whereby your
program slowly gets larger and larger as it runs (or perhaps
not so slowly!).
dress *gaddr_p;
address *paddr_p;
int event;
int (*routine)();
int arg;
\end{verbatim}
\end{func}
\index{process groups, watch member}\index{\tt pg\_watch}
\begin{defs}
\defn{gaddr\_p} Address of process group to watch.
\defn{paddr\_p} Address of process to watch.
\defn{event} One of {\tt W\_JOIN} or {\tt W\_LEAVE}.
\defn{routine} Routine to be called if the requested event occurs.
\defn{arg} Argument to be passed to {\tt romultiple.tex 664 372 212 3173 4673512100 6507 In rare situations, it may be useful to run multiple instances of ISIS
on a single machine, thus simultating a network of multiple computers
using one physical machine.
\index{simulating a multi-computer network}
To do this, you must use a special {\tt sites} file. Say that your
machine is named {\tt fafnir.cs.cornell.edu} and that you wish to simulate
4 machines.
The sites file should list {\tt fafnir} 4 times, as follows:
\begin{verbatim}
+ 1:1500,1501,1502 fafnir.cs.cornell.edu/1 csdept sun4
+ 2:1600,1601,1602 fafnir.cs.cornell.edu/2 csdept sun4
+ 3:1700,1701,1702 fafnir.cs.cornell.edu/3 csdept sun4
+ 4:1800,1801,1802 fafnir.cs.cornell.edu/4 csdept sun4
\end{verbatim}
Notice that the different copies of the fafnir entry {\em all have
different port numbers}. This is required.
There are two ways to start ISIS on this pseudo-network of 4 machines.
One is to run {\tt isis} in the normal manner, with no special arguments.
It will start 4 instances of itself, one for each fafnir entry in the
{\tt sites} file, delaying a few seconds after the first one is
started to avoid a confusing start sequence.
You can also start one copy at a time using the option {\tt -Hn} to
the {\tt isis} command. \index{{\tt isis}, {\tt -Hn} option}.
This starts a version of {\tt isis} that runs ``only on host n''.
It will check for other active copies of {\tt isis} on the other possible
host-id's and do a total or partial restart depending on what it finds.
Starting {\tt isis} this way, perhaps in 4 console windows, has
a big advantage: it lets you experiment with killing off some of the
pseudo-sites and restarting them while your code is running.
int n_elem;
message *msg_p;
...
msg_get (msg_p, "%A", addrlist, &n_elem);
\end{verbatim}
After this call, the addresses in the message will be stored in the first
five elements of the array {\tt addrlist}, and {\tt n\_elem} will have the
value 5.
If the maximum size is not known, the following alternative may be used.
\begin{verbatim}
address *addrnews.tex 666 525 212 14321 4603432034 5647
\label{Ch:news}
The news service\index{news facility}\index{bulletin board facility} is a facility similar to a bulletin board.
It allows an ISIS application to {\em post} a message
that may be read by any process that {\em subscribes} to the news service.
\subsection{How to post and receive news messages}
To post a message, a process calls {\tt news\_post}:\index{\tt news\_post}
\begin{verbatim}
news_post(sitelist, subject, msg_p, back)
site_id sitelist[];
char *subject;
message *msg_p;
int back;
\end{verbatim}
{\tt Sitelist} is a {\sc null} terminated list of sites.
The posted message will be distributed to all sites mentioned
in this list.
If a null pointer is given instead, the message will be forwarded to
all operational sites.
Every news message is posted under a certain {\em subject}.
A subject is an arbitrary string of up to {\tt SUBJLEN} characters.
For every subject the news service at each site
keeps a list of recently posted messages which new subscribers may look at.
When a message is posted, the parameter {\tt back} determines how long
the message will be retained as a ``back issue''.
If {\tt back = 0}, the message will be forwarded only to current subscribers
and will be deleted immediately afterwards.
If back is greater than zero, for example back = 5, the message will
be held until five new messages have been posted to the same subject.
A process that wishes to receive news messages calls\index{\tt news\_subscribe}
\begin{verbatim}
news_subscribe(subject, entry, 0)
char *subject;
int entry;
\end{verbatim}
The subscriber will then receive a copy of all new messages posted
under the given subject.
The news service will send these messages to the isis entry point
number specified in the parameter {\tt entry}.
A new subscriber may also be interested in looking at earlier posted
messages that are retained as back issues.
In this case {\tt news\_subscribe} should be called with a non zero value
for the last parameter.
The general interface looks like this:
\begin{verbatim}
n = news_subscribe(subject, entry, nback);
int n;
char *subject;
int entry, nback;
\end{verbatim}
The parameter {\tt nback} specifies how many back issues the subscriber
wants to receive.
{\tt News\_subscribe} returns the number of back issues
that will be sent to the subscriber;
{\tt n} may be less than {\tt nback} if there are not as many
back issues available as requested.
When a message is posted, the news service automatically adds two
fields to the message containing the parameters
{\tt subject} and {\tt back} from the call to {\tt news\_post}.
A subscriber may inspect these fields by calling {\tt msg\_getfld()}
with {\tt FLD\_SUBJ} and {\tt FLD\_BACK} as field numbers.
Messages kept as back issues on a certain subject may be deleted explicitly
by using one of the following routines:\index{\tt news\_clear}
\begin{verbatim}
news_clear(slist, subject)
news_clear_all(slist, subject)
site_id slist[];
char *subject;
\end{verbatim}
{\tt News\_clear} deletes all messages posted by the caller, whereas
{\tt news\_clear \_all} deletes all messages posted by anybody.
A subscription may be canceled by calling
\begin{verbatim}
news_cancel(subject);
char *subject;
\end{verbatim}
{\tt News\_post} uses CBCAST to broadcast the message to the news services
at other sites.
If it is important that all subscribers receive news messages
in the same order, the following call should be used:\index{\tt news\_posta}
\begin{verbatim}
news_posta(sitelist, subject, msg_p, back)
site_id sitelist[];
char *subject;
message *msg_p;
int back;
\end{verbatim}
{\tt News\_posta} behaves exactly like {\tt news\_post}
except that it uses an ABCAST to post the message.
\subsection{Example: a load monitoring program}
In the following example a process posts a message about its system load \index{load monitoring program based on news service}
statistics under the subject ``Load Stat'':
\begin{verbatim}
long current_load;
message *msg_p;
current_load = get_current_system_load();
msg_p = msg_newmsg();
msg_put(msg_p, "%s %l", my_host, current_load);
news_posta(0, "Load Stat", msg_p, 1);
\end{verbatim}
The posted message will be saved by the news service until one other
message is posted to the subject ``Load Stat'' ({\tt back = 1}).
Some other process might be interested in monitoring load statistic
and printing a message whenever the load at any site exceeds a
certain value:
\begin{verbatim}
#define LOAD_MON_ENTRY 13
isis_entry(LOAD_MON_ENTRY, load_monitor, "load_monitor");
...
n = news_subscribe("Load Stat", LOAD_MON_ENTRY, 3);
...
load_monitor(msg_p)
message *msg_p;
{
char *subject, *host;
long load;
msg_getfld(msg_p, FLD_SUBJ, 0, "%-s", subject);
msg_get(msg_p, "%s %l", &host, &load);
if (load > 5) {
printf("==> %s <==\n", subject);
printf("Load at host %s = is %d\n", host, load);
}
}
\end{verbatim}
Immediately after subscribing, the processor will receive up to
3 messages held as back issues ({\tt nback = 3}).
\subsection{Diagnostics}
All routines return a nonzero value in the case of an error,
except for {\tt news\_subscribe}, which indicates an error by returning
a negative value.
{\tt News\_post} and {\tt news\_posta} do not wait for replies when
broadcasting a message.
Therefore a successful return does not yet guarantee that the message has
been delivered to remote sites.
Furthermore,
a site crash wipes out all back issues held by the local news service.
The news service does not save messages on stable storage, nor does it
attempt to get back issues from some other site after a recovery.
Messages that were posted while the site was down are not saved
and will never reach that site even when it recovers.
The news service should be viewed as an asynchronous, light-weight
mechanism for making information available to other processes.
If your application requires more consistency or stability
then the process group mechanism should be used: instead of
subscribing to a news subject and posting messages, a process should
join a process group and broadcast to the group.
es the subscriber
wants to receive.
{\tt News\_subscribe} returns the number of back issues
that will be sent to the subscriber;
{\tt n} may be less than {\tt nback} if there are not as many
back issues available as requested.
When a message is posted, the news service automatically adds two
fieldsnstasks.tex 666 420 212 26227 4672032665 6377 \subsection*{ISIS task package under other task subsystems}
\index{main task}
\label{Sec:threads}
ISIS uses an unusual task package that either implements lightweight tasks
(on machines that lack them) or presents a standard lightweight task interface (on machines
that already offer a mechanism for this purpose).
By doing this, ISIS permits the user to take advantage of
lightweight task packages such as the ones
provided by MACH (cthreads), SUN OS4.0 (LWP), Apollo, and LISP implementations
like Allegro, while
retaining a high degree of portability.
Moreover, sophisticated programmers can mix features of a native
task mechanism (such as task pre-emption, concurrent task execution, and
scheduling priorities, with the ISIS scheme.
\subsubsection*{ISIS from a task perspective}
Your ISIS application can be understood to consist of a collection
of lightweight tasks, each having a stack of its own.
Historically, UNIX supports a single stack, which can grow without limit
and is the stack used when your application starts execution.
This is the stack that the procedure called {\tt main} runs on,
and it follows that anything {\tt main} invokes runs without
risk of special stack size limitations.
Tasks are created dynamically by ISIS when messages are received,
when you invoke {\tt t\_fork} explicitly, when you invoke an ISIS
function that itself needs multiple tasks to complete (for example, {\tt bcast }
with the {\tt f} option), when a monitor or watch routine is triggered, etc.
These tasks are all subject to a stack size limitation because they run on
dynamically allocated stacks -- basically, chunks of memory obtained
from {\tt malloc}, and hence {\em not} in the normal stack area for
a UNIX program.
This can lead to some confusion. One source of problems is that UNIX
may incorrectly implement {\tt vfork}, simply because many UNIX kernels
do not understand that a non-standard stack is in use.
Additionally, the debuggers may have trouble giving stack traces on a
program that uses lightweight stacks (although fixed versions are
generally available on systems where lightweight tasks are implemented by
the vendor).
Finally, because almost no UNIX system allow a heavyweight process to execute
while a system call is in progress, be aware that doing any system call at all
will cause the entire process to block, and that doing a blocking IO call (e.g.
a read from the terminal) will block all the tasks in your whole program for as
long as it takes to complete the read. Messages from outside the program into it
would just queue up while this is happening, until finally a very large
backlog develops and ISIS kills the process in question by sending it a
``channel overflow'' signal.
This problem is avoided using {\tt non-blocking} IO calls or {\tt isis\_input} and
{\tt isis\_input\_sig}.
For the most part, despite these limitations, the use of lightweight tasks is transparent --
provided that stack overflows do not occur.
If a stack overflow does occur, data structures can easily be smashed, leading to
any of a variety of bizarre crashes. We can't help much on this -- you just need to
think about it and perhaps experiment with a larger stack size to see if it helps.
\subsubsection*{Use of tasks in ISIS}
ISIS has evolved from a severely restrictive approach to tasks,
which discouraged the use of the main
system stack and forced non-preemptive tasks on our
users, to a more flexible approach that takes a much more flexible
point of view. We now encourage users to call ISIS routines
directly from the main stack, treating ISIS much like any other
subroutine package or library.
In this approach, the only routines run as lightweight tasks are the
ones that ISIS (or your code) explicitly forks off during execution.
The big advantage is that your ``main'' program is exempted from stack size
limits and looks rather normal.
The disadvantage is that the main program must still avoid blocking IO
calls that might leave the application ``hung'' from the perspective of ISIS,
and must periodically call {\tt isis\_accept\_events} to permit ISIS to
schedule other tasks and receive messages.
Users can also make use of task preemption in a limited way, as outlined below.
\subsubsection*{Callback style of message delivery}
Lightweight task users sometimes question the decision to have
{\tt isis\_entry} spin off a new task for each message
received. We do this for many reasons: it makes it easy to explain
what virtual synchrony means to users, it avoids risk of deadlock,
and it is certainly an easy way to deal with task creation.
\index{\tt msg\_rcv} \index{\tt msg\_ready} \index{\tt MSG\_ENQUEUE}
\index{\tt isis\_entry}
But, ISIS does allow users to receive messages using a synchronous
receive primitive. In {\tt isis\_entry}, if
the function is specified as {\tt MSG\_ENQUEUE}, messages to the
corresponding entry point are queued up and must be individually
received using a call {\tt mp = msg\_rcv(entry)}.
The calling task will block until a message is available; a
call to {\tt msg\_ready(entry)} will return the number of messages
available on the entry or 0 if there are none.
We generally recommend that users program using a call-back style
in which task creation occurs on each message. However,
LISP users and users with special performance requirements may need
to consider {\tt msg\_rcv} because of its much lower overhead.
When using {\tt msg\_rcv} the normal virtual synchrony guarantees only
apply as long as there is always a task waiting for the next message.
If your application blocks for even brief periods with no task
waiting on the entry point, any message received at this point could
be seen in different process-group views by different recipients.
\subsubsection*{ISIS and native task mechanisms}
Under systems that already support lightweight tasks,
ISIS must normally be compiled and linked with the
appropriate library (see the ISIS make files to
determine the appropriate option, if this is not the default,
as indicated in the makefile).
For example, SUN programs that run under ISIS may be linked either
with an ISIS tasks package or the one from SUN. By default, we
use our own, but if you compile the ISIS client library with flag -DSUNLWP
and specify the flag {\tt -llwp} to the loader, as illustrated
in the SUN3 system makefile,
ISIS will automatically map calls to {\tt t\_fork}, etc, into
calls to the corresponding routines from the SUN lwp package.
Moreover, when ISIS is called from a task created using the
task creation mechanisms these packages employ, ISIS will create
the necessary internal data structures to ``adopt'' that task as one of
its own.
Note that the
use of LWP is not the default on the SUN.
Most lightweight packages permit some degree of true parallelism
or preemptive scheduling. The rule is that when inside ISIS, in inside
a task run by ISIS, the mutual exclusion lock called {\tt isis\_mutex}
\index{Nonstandard task packages}{\tt isis\_mutex} will be locked.
This means that ISIS tasks can never preempt one-another, but that
other non-ISIS tasks can run concurrent with an ISIS task, e.g.
for mouse tracking or some other function unrelated to ISIS.
If such a task tries to enter ISIS, however, it will be forced
to block until it can acquire {\tt isis\_mutex}, be added to the ISIS run queue,
and finally scheduled in the ISIS first-in first-out manner.
For example, say that pushing a button on an X-windows interface
starts a task pre-emptively, but that this task calls a routine that
needed access to ISIS.
Such a call is permitted in the current scheme.
{\em However,} because entry to ISIS involves acquisition of a lock
(mutex) variable, the task might delay when trying to do this call.
The effect might be a but surprising---the X-windows ``button'' would stay
active while some set of messages were received and unrelated ISIS actions
performed, and only when ISIS next gets a chance would this entry be
scheduled and the button routine permitted to resume execution.
Earlier in this section, we warned about limits on the stack size when running
as a lightweight task under ISIS.
In fact, these limits depend on the package you are using and the
way that a specific task came into existence.
The general rule is that when you specify a stack size to ISIS, it will
try and use this information for stack creation.
Under {\em cthreads}, which allows very large stacks,
the ISIS stack limit is simply not applied, and the
task stacks are grown using the standard MACH VM method. This is
somewhat slower than what ISIS normally does, but eliminates all risk of
stack overflow.
Under {\em SUN LWP} ISIS tasks are always created at priority 0 and
are allocated fixed size stack segments of the size given
in {\tt isis\_set\_stacksize}
or {\tt isis\_entry\_stacksize()}.
We considered calling {\tt stk\_newstk} but decided that since you can
do this external to ISIS and then have the task you created call into ISIS,
there wasn't much reason for us to do this automatically.
{\em Warning to Apollo users:} ISIS tasks all enable asynchronous interrupts
(otherwise, ISIS programs are not just fault-tolerant, but unkillable!)
Since Apollo lacks condition variables, ISIS uses mutex
variables for this purpose. In principle this is completely
safe for the type of thing that we are doing.
Macros can be found in the ISIS include file {\tt cl\_task.h} (automatically
included with {\tt isis.h} mapping the various task mechanisms into a single
uniform abstraction.
\subsection*{Porting old code to run under a task subsystem}
As noted earlier in this manual,
there are several acceptable startup sequences for ISIS.
The most important of these is for programs in which calling {\tt isis\_mainloop}
poses a problem.
A typical example would arise in the case of porting ``old code''
to run under ISIS.
Here, one has the problem that the existing program is unlikely to be task
structured.
Squeezing some unknown block of code into ISIS raises questions of stack size,
and hence can be a difficult undertaking, and the {\tt t\_on\_sys\_stack}
mechanism is awkward at best.
The alternative we recommend in this case is to arrange for a start
sequence that calls {\tt isis\_init()} directly,
does whatever group joins are desired in line (without ever calling
{\tt isis\_mainloop}!), calls {\tt isis\_start\_done()},
and then enters a large loop running the old code
and periodically calling {\tt isis\_accept\_events(flag)} to give
ISIS a chance to run.
A particularly convenient old program would be one that reads records from a
file and processes them one at a time.
In such a program, the ``file'' can be simulated using an internal
buffer that is loaded by ISIS tasks on receipt of messages.
The existing program would loop, reading a record from the buffer and
processing it, then calling {\tt isis\_accept\_events(ISIS\_BLOCK)}
repeatedly until the buffer has more work in it.
An alternative uses a select-style timeout structure, here you
specify {\tt ISIS\_TIMEOUT} followed by an additional argument
giving the timeout interval as a pointer to a {\tt struct timeval}
(see SELECT(3)). \index{\tt ISIS\_ASYNC}\index{\tt ISIS\_TIMEOUT}
\index{\tt ISIS\_BLOCK}
Notice that in this approach, the ISIS limit on stack size still applies
to any tasks ISIS starts up (say, on message receipt), but
does {\em not} apply to the old code.
ISIS run queue,
and finally scheduled in the ISIS first-in first-out manner.
For example, say that pushing a button on an X-windows interface
starts a task pre-emptively, but that this task calls a routine that
needed access to ISIS.
Such a call is permitted in the current scheme.
{\em However,} because entry to ISIS involves acquisition of a lock
(mutex) oload.tex 666 420 420 4024 4353744675 5766 ISIS uses a number of mechanisms to cope with very heavy load.
\index{load control in ISIS}
\index{overload and congestion}
They sometimes work, but are less than fully reliable in this version of the system.
The first overload mechanism is to choke off clients, so as to reduce the
number of new requests coming into the system.
ISIS does this by signaling ``congestion'' in the main system log, then
telling clients to stop initiating new broadcasts.
\index{congestion (overload)}
Clients are still permitted to receive messages and send replies, but
attempts to initiate new broadcasts will block.
When the load drops back to a reasonable level, considerably below the threshold
for signaling congestion, ISIS prints that it has become decongested and
permits clients to resume broadcasting.
A number of factors enter into the decision that ISIS is congested:
number of bytes of message memory in use, of normal memory in use,
number of active system tasks, and number of bytes of data in the inter-site
layer of the system.
A second load mechanism is used when a client, either because it is
overwhelmed with work or in an infinite loop, ceases to read messages
from ISIS.
When the backlog of messages gets sufficiently large, ISIS will
discard new messages and send a signal, called SIGOVERFLOW in the ISIS system.
SIGOVERFLOW is usually defined on top of some normally unused signal.
\index{{\tt SIGOVERFLOW} sent during congestion}
On most ISIS configurations, this will kill any program that does not explicitly
catch the signal.
There is currently no way to negotiate with ISIS over the threshold
value above which clients will be killed.
The current limit is 32k bytes of backlogged data, plus an additional 8k
bytes of data that can be in the channel between ISIS and a client
before ISIS notices that the client is unresponsive.
Moreover, even if congestion occurs, ISIS will not kill a client that
is continuing to read messages from the system.
ISIS also has a limit on how many clients it will talk to simultaneously.
Currently, this limit is 128.
f mechanisms to cope with very heavy load.
\index{load control in ISIS}
\index{overload and congestion}
They sometimes work, but are less than fully reliable in this version of the system.
The first overload mechanism is to choke off clients, so as to reduce the
number of new requests coming into the system.
ISIS does this by signaling ``congestion'' in the main system log, then
telling clients to stop initiating new broadcasts.
\index{congestion (overload)}
Clients are still permitted pdump.tex 666 420 212 16144 4657302335 6030 The protocols process also produces dump information that can be useful when
\index{protocols process dump}
interpreting a troublesome system state.
The dump has roughly the same format.
It consists of information about the memory usage for the protocols
process, which can become fairly substantial when a lot goes
on at one time,
information about currently active tasks (these are typically concerned with
actually performing a broadcast on behalf of some client),
site and process group views,
internal data structures used by the protocols,
information about the queues of messages associated with each client and
the processes (if any) that each client is monitoring
using {\tt proc\_watch},
and a collection of statistics about the inter-site communication
channels that are in use.
The protocols process produces a dump if a crash occurs.
In addition, it can be asked to make a dump (without shutting down)
using the {\tt cmd} tool or by sending it a {\tt SIGUSR2} signal.
\index{protocols process dump, creating using {\tt cmd} tool}
\index{protocols process dump, creating using {\tt kill} command}
\index{protocols process dump, creating due to ISIS system crash}
The dump is either added to the protocols process log file,
called {\tt .log}, where {\tt } is the site-number,
\index{protocols process dump, name of log file}
or printed on the screen if the {\tt cmd} tool is given a {\tt dump}
command.
During normal execution, the protocols process puts few
messages in its log file.
Large log files or log files containing strange messages
are usually a sign of a problem.
Many such problems are easily resolved: an error in the {\tt sites}
configuration file, or an application that sends 10 replies to
a caller who asked for 1, will result in precise and fairly
easily understood error messages in this file.
Other problems, such as system bugs that cause a crash, will
often cause log file entries, but it is unlikely that a novice
could make much sense of them.
Protocols process dumps can be extremely large and complex.
It is certainly not a bad idea to make one if the ISIS system
seems to be in a hung state, or some other problem seems
to have arisen.
But, do not expect to make much sense out of it, or to diagnose
ISIS system bugs by examining these dumps.
They are highly detailed and technical, and considerable
expertise is needed to interpret them.
Here is a typical example, made using the {\tt dump} command in the command
tool when the site was idle:
\begin{verbatim}
PROTOCOLS PROCESS 8/0 INTERNAL DUMP REQUESTED: ISIS protocols process status dump
Memory mgt: 39396 allocs, 39290 frees 24632 bytes in use
Message counts: 1111 allocs 1107 frees (4 in use)
Message memory: 8436 memallocs, 8407 memfrees (3376 bytes in use)
tasks: scheduler 4b69c ctp cc0c8
TASK[cc0c8]: cl_wantdump(cbd74), ** running **
runqueue:
Site view 1: <8/0>
Process group views: root 6dd60
associative store: as_ndelete 0, as_nlocdelete 0
abq:
max_priority = d08
cbcast data structures:
pbufs:
pb_itemlist:
idlists:
piggylists:
gbcast data structures:
gb_locks:
wait queues:
glocks:
Failure detector: current view 1:
slist: 800
incarn: 8/0
failed: `0'
recovered: `000000001'
coord, no fork, no fail, no prop, no oprop, not sent_oprop
Pending failures:
Pending recoveries:
Replies wanted:
clients:
: idle
: idle
client pid=7181: idle
intersite:
Intersite:
Allocated messages:
\end{verbatim}
The structure is similar to that of a client dump, but includes information about
the data structures used by the protocols and the states of the various clients
at this site.
Here is a dump made when ISIS was a little more active.
\begin{verbatim}
PROTOCOLS PROCESS 7/0 INTERNAL DUMP REQUESTED: ISIS protocols process status dump
Memory mgt: 5247270 allocs, 5246829 frees 87960 bytes in use
Message counts: 398970 allocs 398962 frees (8 in use)
Message memory: 2431917 memallocs, 2431864 memfrees (6368 bytes in use)
tasks: scheduler 497ac ctp 90178
TASK[ac1c4]: cl_gbcast(65660), wait(backoffcond:afce8)
TASK[8c170]: cl_gbcast(5865c), wait(backoffcond:8fc94)
TASK[b41d4]: pg_failed(1b2a), wait(backoffcond:b7ca4)
TASK[90178]: cl_wantdump(586c4), ** running **
runqueue:
watch_queue[1]:
watch_queue[2]:
watch_queue[3]:
watch_queue[4]:
watch_queue[6]:
watch_queue[7]:
watch_queue[8]:
watch_queue[10]:
watch_queue[12]:
Site view 9: <2/0> <3/0> <1/0> <4/0> <6/0> <12/0> <7/0> <10/0> <8/0>
Process group views: root 55e48
Client(7/0:6954.0)
(gid 4/0.15[0])(* delsent *) =
VID 8, 8 members, 0 clients, membs=[(4/0:27299.0)(2/0:9158.0)(3/0:11428.0)(1/0:2658.0)(6/0:2968.0)(12/0:26852.0)(7/0:6954.0)(10/0:10398.0)]. clients=[]
associative store: as_ndelete 0, as_nlocdelete 0
abq:
max_priority = 5fb03
cbcast data structures:
pbufs:
pb_itemlist:
idlists:
piggylists:
gbcast data structures:
gb_locks:
wait queues:
glocks:
Failure detector: current view 9:
slist: 200 300 100 400 600 c00 700 a00 800
incarn: 1/0 2/0 3/0 4/0 6/0 7/0 8/0 10/0 12/0
failed: `0'
recovered: `000000001'
not coord, no fork, no fail, no prop, no oprop, not sent_oprop
Pending failures:
Pending recoveries:
Replies wanted:
clients:
: idle
: idle
: idle watched by sites < 8 10 >
: idle watched by sites < 4 >
client pid=7189: idle
intersite:
Intersite: I heard from `0111101110101', they heard from `0111101110101'
1/0 [sol.cs.cornell.edu]:
estab;alive;sl32,nb0,bk0,os5,is113,as5
2/0 [shamesh.cs.cornell.edu]:
estab;alive;sl32,nb0,bk0,os60,is123,as60
3/0 [iving.cs.cornell.edu]:
estab;alive;sl32,nb0,bk0,os45,is94,as45
4/0 [bifrost.cs.cornell.edu]:
estab;alive;sl32,nb0,bk0,os3,is64,as3
6/0 [aditi.cs.cornell.edu]:
estab;alive;sl32,nb0,bk0,os96,is88,as96
8/0 [modi.cs.cornell.edu]:
estab;alive;sl32,nb0,bk0,os40,is40,as40
10/0 [hymir.cs.cornell.edu]:
estab;alive;sl32,nb0,bk0,os86,is67,as86
12/0 [embla.cs.cornell.edu]:
estab;alive;sl32,nb0,bk0,os29,is78,as29
\end{verbatim}
This particular dump shows several tasks, running the {\tt gbcast}
protocol on behalf of client process 6954 which is running at this site.
In another dump made a little later, these protocols will have terminated
and the corresponding tasks will have vanished.
A dump of a very active site can be enormous and very complex.
This is not a sign of trouble, unless something else seems to be malfunctioning.
If, however, something IS malfunctioning, the dump is the best way for
the ISIS project to diagnose the problem. So, if you suspect a
system bug, make dumps from all the active sites and send ALL the dumps
to us. Keep in mind that in a distributed system,
a problem at one site may manifest itself in misbehavior at some other site
entirely.
Fortunately, ISIS has been debugged carefully and we now encounter problems
very rarely. With dumps, we have been very successful in fixing those problems
that do come to our attention.
arn: 8/0
failed: `0'
recovered: `000000001'
coord, no fork, no fail, no prop, no oprop, not sent_oprop
Pending failures:
Pending recoveries:
Replies wanted:
clients:
: idle
: idle
client pid=7181: idle
intersite:
Intersite:
Allocated messages:
\end{verbatim}
The structure is similar to that of a client dump, but includes information about
the data structures used by the propgroup.tex 666 372 212 62510 4675156361 6231 \section{Process groups and process lists}
Process groups\index{process groups}
form the unit upon which many ISIS operations are based, the
most common one being to broadcast a message to the members of a
group and obtain replies.
\marginpar{{\em Process-groups are the most important ISIS mechanism.}}
The process group mechanism also provides options to allow state
information to be transferred to a joining process, to enable
logging of data on disk, to automate the recovery of the group from
disk should all the members fail, and to authenticate the credentials of
processes attempting to join or access the group.
Some of the ISIS facilities that operate based on a process group are
\begin{description}
\item{Replicated data.}
This consists of mechanisms for
replicating data among the members of a group.
See Chapter 9 for details.
\item {Distributed execution.}
This consists of a collection of methods for distributing the
execution of a request over the members of a process group (Chapter 10).
For example, the {\em coordinator-cohort tool} provides a support for distributed
executions where one member of a process group (the coordinator) carries out
a computation, while other members (the cohorts) monitor the coordinator and
take over to complete the computation if the coordinator fails.
\item{The token tool.}
Provides synchronization among the members of a process group.
It includes features to handle the failure of a process holding mutual
exclusion (Chapter 11).
\end{description}
Process groups are intended to be highly dynamic entities.
A process can create a new group, join an existing one, or leave a group at
any point in its computation.
Membership in a process group is cheap, and
any given time, a process may be a member of several groups.
Conversely, several different groups may have the same membership.
It is possible for a group of processes to temporarily form a group to make
use of one of the options of the process group mechanism or to invoke one
of the tools listed above, after which they may all drop out of the group.
This section describes the routines that deal with joining, leaving, and
obtaining information about process groups.
Groups have symbolic names
\index{process groups, symbolic names for},
and this raises the question of uniqueness.
Because ISIS is intended to be useful in large networks,
the namespace has been divided into naming ``scopes'' or domains.
A scope is just a set of sites, defined in the system ``sites'' file at
system startup time. Although sites can crash and reboot, the scope
remains static.
One site may be a member of more than one scope. All sites belong to the {\em global}
scope.
A group name must be unique within the scope in which it was created, and\index{scope, for namespace searches}
namespace searches are limited to the scope specified at the time
the operation was invoked.
The bigger the scope, the more costly these rules are for ISIS to enforce.
To obtain the best possible performance, namespace scopes should be small and
``accurate''. For example, on a factory floor, each manufacturing cell might
be given a namespace scope of its own.
Groups formed internal to the cell could thus have names confined to the
cell's namespace scope, and any costs associated with creating or searching
for those names would be confined to the sites that comprise the cell.
As a network gets large, controlling these costs becomes critical to
obtaining good performance.
The programmer specifies the scope within which ISIS should search
for a process group name by using a special notation.
A group name is either
given as a simple string, e.g. ``timeservice'' or ``/pmake/mygroup'',
or in the form ``@scope:name'', e.g. ``@building2:timeservice''.
The scope name and group name are each limited to 64-characters in length.
The global scope, which encompasses all sites, is specified by ``@*:name''.
If no scope is specified, the
scope will be the scopes to which the site where your application is run belongs (as
defined in the {\tt sites} file; see ``Setting up ISIS'').
If none are specified in the sites file (which would certainly be an error)
the global scope is assumed.
To avoid
accidental name conflicts within a scope, we recommend that a
UNIX file-system style of naming groups be employed: ``/pmake/mygroup'' rather than ``mygroup'', or
``/nmrd/sigpro1'' and ``/nmrd/sigpro2'' rather than just ``sigpro1'' and
``sigpro2''.
In systems with multiple similar components (i.e. multiple {\tt sigpro}
processes) it will generally be a good idea to have a process group
to which all of these belong (i.e. ``/nmrd/sigpro'') and then to name
the component processes in the order that they are started up, by
counting the number of processes that have joined the parent group.
If ISIS cannot map the scope you give to a set of sites,
group create and lookup operations will fail.
The {\tt cmd} tool can be used to list the currently known scopes, to cause the sites file
to be {\tt rescanned} after a change is made (perhaps to add a new site or scope),
and to {\tt list} the process groups known in any specified scope.
This tool is described in section~\ref{sec:cmd} of the manual.
\subsection*{Process lists}
A {\em process-list} is a convenient way to communicate with a subset of the
members of a process group.
\index{process lists}
\marginpar{{\em Process-lists are a new feature for the advanced ISIS user.}}
A recent addition to ISIS, this facility is used privately by a member
of a process group that needs to communicate with some of the group
members but doesn't want to communicate with all of them at once, or to
pay the cost of forming a separate subgroup for each communication.
A list has an address somewhat like a process group address, but
the value is only meaningful within the process that created the list,
and the list cannot be given a symbolic name.
A process list is completely local to the process that created it, and
few of the ISIS tools can be used on them. However, the
facility is integrated into the
high speed ``bypass'' protocol suite mentioned in Chapter 2, and
hence can be used for fast multicasting.
Process lists will be covered in detail in Chapter 5.
\subsection*{Creating and joining process groups}
\begin{func}{Join a process group. Optionally monitor membership, transfer
state, etc.}
\begin{verbatim}
address *
pg_join (gname,
:
PG_KEYWORDi, argi1, argi2, ...,
:
0)
char *gname;
\end{verbatim}
\end{func}
\index{\tt pg\_join}\index{process groups, joining and creating}
\begin{defs}
\defn{gname} Null-terminated string giving the name of the group.
\end{defs}
If a group with the given name already exists in the specified scope, the process
calling {\tt pg\_join} is made a member of the group.
Otherwise, a new group is created with the calling process as its only
member unless the {\tt PG\_DONTCREATE} option is specified (see below).
The address of the group is returned.
If the group does not exist in the specified scope,
and the scope does not include the calling site, {\tt pg\_join} will fail
and return {\tt NULLADDRESS}.
{\em Note:} The new member can be at a site that is not included in the
scope of the search, but in this case if all the other group members
fail and only the new member remains operational, searches using the original scope
will not find the group.
Options to {\tt pg\_join} are placed between the first argument and the
last, which is always $0$.
Each option consists of a keyword followed by the arguments required
for that option.
Some of the keywords are listed below, with their arguments in square
brackets.
Other options are discussed in later sections.
\begin{defs}
\defn{PG\_DONTCREATE} [no arguments] Disallows group creation if a
group does not already exist.
\index{{\tt pg\_join}, {\tt PG\_DONTCREATE} option}
\defn{PG\_LOGGED} [{\tt fname, reply\_entry, flush\_type, end\_replay}]
\index{{\tt pg\_join}, {\tt PG\_LOGGED} option}
Tells the system that the state of this
process group should be backed up on a {\em log} file.
On recovery, if the process group is restarting after a total failure and the
group does not already exist, the log will be reread to recover the group
state.
Otherwise, if the group was already
operational, the log will be ignored and the join will make use of a state transfer
from an operational member.
Logging and the meaning of the arguments to {\tt PG\_LOGGED}
are discussed in much more detail in chapter 12.
\index{process groups, enabling logging during {\tt pg\_join}}
\defn{PG\_INIT} [{\tt routine} (pointer to a function)] Calls {\tt routine (gaddr\_p)}
\index{{\tt pg\_join}, {\tt PG\_INIT} option}
if this group is being created and no log for it could be found.
The {\tt routine} is
responsible for the first-time initialization of the group, which will have been assigned
address {\tt gaddr}.
{\em Warning:} the initialization routine will be invoked
before {\tt pg\_join} returns.
\defn{PG\_MONITOR} [{\tt routine} (pointer to a function), {\tt arg}
\index{{\tt pg\_join}, {\tt PG\_MONITOR} option}
(integer)] Sets up a monitor for group membership changes.
{\tt routine (gview\_p, arg)} is invoked as a task with the new group view and {\tt arg} as
parameters whenever the group membership changes (see {\tt pg\_monitor}
below).
{\em Warning:} the monitor routine will trigger for the first time
before the {\tt pg\_join} returns.
Notice that the group address is available as part of the {\tt groupview} data structure.
\defn{PG\_XFER} [{\tt domain} (integer), {\tt send\_routine}, {\tt
\index{{\tt pg\_join}, {\tt PG\_XFER} option}
receive\_routine} (pointers to functions)] Transfers state to
the joining process (see {\bf State transfer} below).
The send routine can be given as {\tt xfer\_refused},
\index{{\tt xfer\_refused}}
\index{{\tt pg\_join}, use of {\tt xfer\_refused}}
in which case state transfers will not be attempted from this process (this
routine can also be called from a state sending routine if transfer is
temporarily impossible).
\defn{PG\_BIGXFER} [no arguments] No longer supported.
\defn{PG\_CREDENTIALS} [{\tt credentials} (null-terminated string)] Defines
\index{{\tt pg\_join}, {\tt PG\_CREDENTIALS} option}
{\tt credentials} to be the credentials string for this process.
It is presented to an existing member of the group to obtain authentication
for this join request.
\defn{PG\_JOIN\_AUTHEN} [{\tt routine} (pointer to a function)] Prevents
\index{{\tt pg\_join}, {\tt PG\_JOIN\_AUTHEN} option}\index{protection}
processes from joining the group without presenting credentials.
\index{protection}\index{authentication during join}
When a process attempts to join,
{\tt routine} is called in one of the existing members
with the credentials string of the joining process passed
as its first parameter.
If {\tt routine} returns $-1$, the join is refused.
\defn{PG\_CLIENT\_AUTHEN} [{\tt routine} (pointer to a function)] Makes it possible to
\index{{\tt pg\_join}, {\tt PG\_CLIENT\_AUTHEN} option}
prevent
processes that are neither members nor clients from
communicating with this group (see {\tt pg\_client} below).
When a process attempts to become a client by calling {\tt pg\_client},
{\tt routine} is called in one of the existing members
with the credentials string of the process passed
as its first parameter.
If {\tt routine} returns $-1$, the {\tt pg\_client} request is refused.
\end{defs}
The sequence of events that occurs when a {\tt pg\_join} is invoked is as follows.
First, either the
initialization routine or the state transfer receive routines will be invoked,
depending on the current state of the group.
Next, the first monitor invocation will take place.
After this {\tt pg\_join} will return.
Only after all of these events will incoming requests to the group begin to
be processed.
Reception of incoming requests will be deferred even longer if the task
that called {\tt pg\_join} is the startup task.
In this case, the group will not receive incoming requests until that task
either terminates or calls {\tt isis\_start\_done}
\index{process groups, sequence of events during {\tt pg\_join}}
\begin{func}{Temporarily inhibit process-group joins}
\begin{verbatim}
pg_join_inhibit(state);
\end{verbatim}
\end{func}
\index{\tt pg\_join\_inhibit}
\begin{defs}
\defn{state} One of {\tt TRUE} or {\tt FALSE}.
\end{defs}
This facility is intended mostly for use in
coordinator-cohort computations (Chapter 8).
{\tt pg\_join\_inhibit} temporarily prevents new processes from joining
any process group that the process calling {\tt pg\_join\_inhibit} is a
member of.
The function inhibits joins as long as the
{\tt TRUE} requests outnumber {\tt FALSE} requests.
A process attempting to join a process group when joins are inhibited is
simply made to wait until they are not.
\begin{func}{Inhibit new members from joining a process group while an entry is active}
\begin{verbatim}
isis_inhibit_joins(entry_number);
\end{verbatim}
\end{func}
\index{\tt isis\_inhibit\_joins}
{\tt isis\_inhibit\_joins} arranges for the system to call {\tt pg\_join\_inhibit}
when a message to the designated entry point is received, and to undo this
when the entry terminates.
\subsection*{Comments about possible deadlocks during pg\_join}
Users of {\tt pg\_join} should be aware that deadlocks can arise from
certain patterns of interactions between joining processes and the groups
that they join, especially when the call to {\tt pg\_join} is done during startup
(from the main task, before {\tt isis\_start\_done} has been called).
First, notice that if a member of a process group has inhibited joins
then a join request may be forced to wait, possibly indefinitely in the
case where the inhibit was due to a bug.
In such a case it would be necessary to obtain a {\em client dump} (see Chapter 14).
By interpreting the dumps from the process group members and the process attempting
to join, it is easy to determine what has happened.
There are also more insidious ways to cause deadlocks.
In particular, since {\tt pg\_join} involves a request by a potential new
member to the current members of a group, during the interval between when a new member
joins and when it first begins accepting requests, further attempts to join the
group will hang.
Thus, a deadlock can occur if two processes simultaneously both attempt to join the same two
groups during their startup sequence, and do so in different orders.
Such a deadlock could be
eliminated by calling {\tt isis\_startdone} before doing the second
{\tt pg\_join} in either process, or by issuing the two {\tt pg\_join} calls in the same
order in both processes.
\subsection*{Creating subgroups}
\begin{func}{Create a process group with prespecified initial membership}
\begin{verbatim}
address *pg_subgroup(pgaddr_p, sgname, incarn, members, clients)
char *sgname;
address members[], clients[];
\end{verbatim}
\end{func}
\index{\tt pg\_subgroup}\index{process groups, creating subgroups}
\begin{defs}
\defn{pgaddr\_p} Address of the parent group, caller must be a member.
\defn{sgname} Scope and name for the subgroup, as in {\tt pg\_join}.
\defn{incarn} Incarnation number, normally specified as 0.
\defn{members} Null-terminated list of initial members, subset of parent group membership.
\defn{clients} Null-terminated list of initial clients, unrestricted.
\end{defs}
Some applications dynamically create subgroups of existing groups and would
pay a high cost if subgroup members had to joins one by one.
{\tt pg\_subgroup} is provided for this purpose. The group is
created with a prespecified membership and client list.
The members must be a subset of the members of a specified parent group.
Aside from the way in which it is created, a subgroup is a completely normal
process group and permits all the ISIS process group operations.
Subgroup members should manually call {\tt pg\_monitor} to monitor the subgroup
membership (see Section 4.6) and {\tt allow\_xfers\_xd} to enable state transfers
(see Section 4.7), if desired.
Often, it is helpful for the process that creates the group to send an initial broadcast
to the members causing them to take these actions.
A subgroup is a full-fledged group, unlike the process lists
mentioned above.
Subgroups have names, can be accessed by processes other than the
creator, and continue to exist independent of the group from which
they were formed.
All ISIS tools work identically on subgroups.
\subsection*{Leaving a process group}
\begin{func}{Delete a process group}
\begin{verbatim}
int
pg_delete (gaddr_p)
address *gaddr_p;
\end{verbatim}
\end{func}
\index{\tt pg\_leave}\index{process groups, leaving}
\begin{defs}
\defn{gaddr\_p} Address of the group to delete; caller must be a member.
\end{defs}
{\tt pg\_delete} causes the specified group to be deleted. The
\index{\tt pg\_delete}\index{process groups, delete}
group membership and client counts drop to 0, triggering all applicable watch
and monitor routines, and the group then ceases to exist.
If recreated, it will receive a different group-address.
Subgroups of a deleted group are {\em not} deleted automatically when the
parent group is deleted.
From the instant that {\tt pg\_delete} is invoked, the application
ceases to receive messages to the specified group.
\begin{func}{Leave a process group}
\begin{verbatim}
pg_leave (gaddr_p)
address *gaddr_p;
\end{verbatim}
\end{func}
\begin{defs}
\defn{gname} Address of a process group.
\end{defs}
The process calling {\tt pg\_leave} ceases to be a member of process group
{\tt gaddr\_p}.
When a process terminates (either normally or because of a failure), this
function is invoked automatically once for every group that the process was
a member of.
It may occur to you to wonder if
{\tt pg\_leave} has an {\em instantaneous} effect on the group, or
if there might be some delay before it takes effect during which additional
messages could still be delivered to the leaving process.
In fact, the answer is a mixture of these modes.
For the process leaving a group, the effect is instantaneous.
As soon as {\tt pg\_leave} is called---even before it returns---no
further messages will be delivered to this process because of its
prior membership in the group.
However, {\em other} processes may see the leave occur somewhat later,
and it might even seem that some messages were delivered to the
leaving process that it in fact did not receive.
ISIS did not always work this way; specifically, we used to impose a delay
during which a process might still get messages even after calling {\tt pg\_leave}.
This, however, proved to be a common source of bugs. The new mechanism
seems closer to what our users wanted, and all the
ISIS group-based tools work correctly under this rule.
\subsection*{Other functions for dealing with process groups}
\begin{func}{Get the address of a process group}
\begin{verbatim}
address
pg_lookup (gname)
char *gname;
\end{verbatim}
\end{func}
\index{\tt pg\_lookup}\index{process groups, determining address}
\begin{defs}
\defn{gname} Null-terminated string giving the scope in which to search and the
name of the group.
\end{defs}
{\tt pg\_lookup} returns the address of a process group, without causing
the calling process to join the group.
\begin{func}{Become a client of a process group}
\begin{verbatim}
pg_client (gaddr_p, credentials)
address *gaddr_p;
char *credentials;
\end{verbatim}
\end{func}
\index{\tt pg\_client}\index{process groups, becoming a client}\index{protection}
\begin{defs}
\defn{gname} Address of a process group.
\defn{credentials} Null-terminated string used as credentials.
\end{defs}
\Marginpar{
\em Process groups can restrict operations to only
members or only clients.
}
If a process group wishes to restrict some of its operations to
be done only by members or clients, it should
specify {\tt PG\_CLIENT\_AUTHEN} in {\tt
pg\_join}.
When an operation is invoked, it should start by calling {\tt pg\_rank\_all} to
verify that the request came from a legitimate
client or member (see below).
Illegal requests can then be explicitly rejected either by
replying with an error indication or by sending a {\tt nullreply} or an {\tt abortreply}.
Notice that ISIS does not perform this check automatically.
To become a client of a process group, a process calls {\tt pg\_client}
with a string describing its credentials.
The client authentication routine is invoked in one of the members, with
{\tt credentials} being passed as its first parameter.
If the routine returns $-1$, the process is not made a client.
This mechanism can be used to enforce security policies.
{\em Performance hint:}
Even if there is no authentication mechanism in place, it might be
desirable to make a process a client of a process group.
Information about a group is always stored locally at each of its members and
clients.
Thus clients use a local copy of this information when communicating with
the group, which leads to better performance than if this information had
to be obtained from another site.
So if a non-member process does a lot of communication with a process
group, it is advisable to make it a client of the group.
This is also true when BYPASS communication is used: clients of a process
group can communicate with it using the BYPASS protocols, whereas
non-clients may be forced to use the old, slower ISIS protocols.
\begin{func}{Get information about a process group}
\begin{verbatim}
groupview *
pg_getview (gaddr_p)
address *gaddr_p;
\end{verbatim}
\end{func}
\index{\tt pg\_getview}\index{process groups, get group view}
\begin{defs}
\defn{gaddr\_p} The address of a process group (obtained by calling {\tt
pg\_join} or {\tt pg\_lookup}).
\end{defs}
This function returns a pointer to a group view structure for the process
group.
\index{{\tt groupview} data structure}
\Marginpar{{\em{\tt groupview} structure}}
The contents of this structure are as follows.
\begin{verbatim}
#define PG_GLEN 128
#define PG_ALEN 64
typedef struct
{
int gv_viewid; /* View number */
int gv_incarn; /* Incarnation number */
int gv_flag; /* Indicates if this is cached or not */
address gv_gaddr; /* Group address */
char gv_name[PG_GLEN]; /* Group name */
short gv_nmemb; /* Number of members */
short gv_nclient; /* Number of clients */
address gv_members[PG_ALEN]; /* List of members */
address gv_clients[PG_ALEN]; /* List of clients */
address gv_joined; /* New member, if any */
address gv_departed; /* Departed member, if any */
int gv_refcount; /* For internal use -- reference counter */
} groupview;
\end{verbatim}
The view number is incremented each time the group membership changes.
The incarnation number is used by the system
to keep track of recoveries when all the
members of the group fail.
{\tt gv\_joined} and {\tt gv\_departed} indicate the address of any process
that joined or departed from the group since the last membership
change, at most one of these will be non-null because only one group
membership change happens at a time in ISIS.
It is possible for both to be null in the case where the only change was to the
list of clients.
Both address lists are terminated by {\tt NULLADDRESS}.
The addresses in {\tt gv\_members} are ordered according to the rank of the
member (see below).
\begin{func}{Obtain the rank of a process in a process group, or
for group membership}
\begin{verbatim}
int pg_rank (gaddr_p, paddr_p)
address *gaddr_p, *paddr_p;
\end{verbatim}
\end{func}
\index{\tt pg\_rank}\index{process groups, rank of member}{process groups, check membership}\index{protection}
\begin{defs}
\defn{gaddr\_p} Address of process group.
\defn{paddr\_p} Address of process.
\end{defs}
{\tt pg\_rank} returns the rank of process {\tt paddr} in process group
{\tt gaddr}, if it is a member of the group.
Each member of a process group has a unique rank, which changes only when
the group membership changes.
The oldest member has rank $0$, the second oldest is rank $1$,
and so on (ties are broken in the same way everywhere).
The rank may be used to give each member of a process group a unique
identifier.
If the process is not a member of the process group, {\tt pg\_rank} returns
$-1$.
It can hence be used to test for membership in a group.
{\em Warning: this routine is extremely slow if called from a non-member,
non-client of the group.}
\begin{func}{Obtain the rank of a process or client in a process group, or test
for group membership}
\begin{verbatim}
int pg_rank_all (gaddr_p, paddr_p)
address *gaddr_p, *paddr_p;
\end{verbatim}
\end{func}
\index{\tt pg\_rank\_all}\index{process groups, get rank of member or client}
\index{process groups, test for membership}\index{protection}
\begin{defs}
\defn{gaddr\_p} Address of process group.
\defn{paddr\_p} Address of process.
\end{defs}
{\tt pg\_rank\_all} is just like {\tt pg\_rank}, but includes clients in the
ranking as well.
Clients have larger rank than members.
Processes that are neither members nor clients are ranked $-1$.
{\em Warning: this routine is extremely slow if called from a non-member,
non-client of the group.}
\begin{func}{Send a UNIX signal to the members of a process group}
\begin{verbatim}
pg_signal (gaddr_p, signo)
address *gaddr_p;
int signo;
\end{verbatim}
\end{func}
\index{\tt pg\_signal}\index{process groups, send UNIX-style signal}
\begin{defs}
\defn{gaddr\_p} Address of process group.
\end{defs}
{\tt pg\_signal} causes the {\em UNIX} signal {\tt signo} to be delivered to
the members of a process group.
This function bypasses the ISIS ordering mechanisms and is intended to be
used only in exceptional situations.
oup.
\index{{\tt groupview} data structure}
\Marginpar{{\em{\tt groupview} structure}}
The contents of this structure are as follows.
\begin{verbatim}
#define PG_GLEN 12preface.tex 666 372 212 33542 4673505572 6325 This manual describes the structure and use of the ISIS toolkit for
distributed and fault-tolerant programming.
The material is structured so that a user unfamiliar with ISIS should be
able to write and test a distributed application after reading just the
first two chapters.
Subsequent chapters cover the various tools provided by ISIS in
much greater detail, and the reader who works through them should
emerge with a thorough practical understanding of what ISIS does and how
to make use of it.
Throughout the manual, an attempt has been made to include code samples
as often as possible.
Our hope is that in many cases, the reader will be able to copy these
code samples, making only a small
number of changes to adapt them into the environments where they will actually be used.
The structure of the manual is as follows:
\begin{description}
\item [Chapter 1. Getting Started]
An introduction to ISIS, focusing on an example of a small but
typical ISIS application.
\item [Chapter 2. ISIS in the Large]
How to approach the design of a typical ISIS application.
\item [Chapter 3. The major components of the ISIS system]
A brief introduction to the structure of ISIS itself.
\item [Chapter 4. Basic Facilities]
Basic data structure and ISIS system calls.
How to maintain replicated data using ISIS.
\item [Chapter 5. More About Messages]
A more detailed discussion of the ISIS message subsystem.
\item [Chapter 6. The Lightweight Task Subsystem]
A more detailed discussion of the ISIS lightweight task subsystem.
\item [Chapter 7. Broadcast Interface]
The long form of the broadcast system call, and the options it supports.
\item [Chapter 8. Virtual Synchrony]
A more detailed presentation of the idea behind virtual synchrony and its
impact on programming in ISIS.
\item [Chapter 9. Replicated data]
A discussion of how to maintain replicated data using ISIS.
\item [Chapter 10. Distributed and Parallel Executions]
Techniques for obtaining distributed and parallel executions in ISIS.
\item [Chapter 11. Synchronization]
How to obtain synchronization and locking using ISIS.
\item[Chapter 12. Transactions]
Dealing with transactional databases and files from within ISIS.
\item [Chapter 13. Bypass communication]
Details of the new mechanisms for fast communication in ISIS.
\item [Chapter 14. The Logging Facility]
How to use the ISIS logging facility to develop software that can recover from total failures
without losing its state.
\item [Chapter 15. Broadcast Types and order]
Types of broadcasts available and how their delivery ordering guarantees
vary.
\item [Chapter 16. Advanced Facilities]
Discussions of a number of advanced topics, including
reception and generation of signals, forking from within ISIS clients,
the remote execution facility, rules for
interacting with devices from within ISIS,
using ISIS in an X-windows program,
using ISIS in a suntools program,
the recovery manager,
an interactive ISIS control program,
creating and interpreting client and
protocol process dumps, how the system behaves when overloaded, and
adding new transport protocols to ISIS.
\end{description}
\newpage
\subsection*{APPENDICES}
\begin{description}
\item [Appendix A. Setting Up ISIS]
A section aimed at the systems administrator responsible for setting up ISIS on a network.
\item [Appendix B. Quick Reference]
A summary of the ISIS system.
\item [Appendix C. Performance of the toolkit facilities]
Will contain detailed performance information about the broadcast primitives and the toolkit
as a whole in a future version of the manual.
\item [Appendix D. Demonstration programs]
ISIS comes with several demo programs.
This appendix explains how to run them.
\item [Appendix E. FORTRAN interface to ISIS]
ISIS can be called from UNIX F77. This appendix summarizes
the changes to the ISIS interface made to support such calls.
Currently, the F77 interface has only been tested under SUN OS, but
it should work under other systems such as MACH as well.
\item [Appendix F. Dealing with old code and non-ISIS task packages]
Some systems on which ISIS runs (MACH, SUN OS 4.0, APOLLO UNIX)
support their own lightweight task/process mechanisms.
You can use these from ISIS, preemptive scheduling and all.
This chapter explains how.
The chapter also covers some issues that arise when integrating
pre-existing programs into an ISIS-based application.
\item [Appendix G. Using ISIS from LISP]
ISIS can be called from Allegro Common LISP and LUCID
Common LISP (we are working on Harlequin). This chapter covers the necessary details.
\item [Appendix H. The META System]
META is a system for defining and monitoring realtime sensors under ISIS and
for triggering actions based on detected events.
\end{description}
\subsection*{Changes to ISIS in switching from V1.3.1 to V2.1}
The purpose of this section is to summarize the ways that ISIS has changed in
going from ISIS V1.3.1 to V2.1.
Most changes are upgrades that continue to support existing code
without requiring modifications.
\begin{enumerate}
\item
This manual has been extensively revised and has major new sections on
issues such as large-systems architecture and long-haul communication.
Research papers are available on most of the major changes, through Cornell.
\item
The new BYPASS facility is working quite well even for many groups and
rapid group membership changes. You need to enable this at compile time
and you need to be consistent in any given group (either all members link
with a "bypass" copy of clib or none do so). In ISIS V3.0 BYPASS will be
the default.
Right now, because {\tt pg\_client} doesn't exploit the BYPASS protocols
we don't enable it by default, but this restriction will soon be eliminated.
\item
The META system has been substantially extended and is described in Appendix H.
\item
There are new ways to connect to ISIS. The interface {\tt isis\_init\_l} offers a way
to force a restart of the system if an application starts up and ISIS is not running; it
also gives some control over whether ISIS will ``panic'' if the system is not up.
Restart is done, if necessary, by running the shell script {\tt /usr/bin/startisis}.
{\em You must define and install this shell script if you plan to make use of this feature.}
\item
There is a new interface, {\tt isis\_remote}, by which remote clients (on machines not
listed in the ISIS sites file) can connect to the system at some ``mother'' location,
obtaining all of its features transparently. Initially this will support only remote UNIX
clients, but we hope to extend the mechanism to support remote clients on other host
operating systems in the future (OS/2, DEC VMS, and perhaps even IBM's VM system).
In the present version of the system, if the mother machine for a remote client
fails, the remote client can trap the resulting exception by defining
a procedure {\tt isis\_failed()} that reconnects to ISIS on a different machine and
returns 0; it will, however, be necessary to rejoin any process groups to which the
application belonged each time this occurs.
The whole mechanism will be made more transparent in ISIS V3.0.
\item
Both {\tt isis\_init} and {\tt isis\_remote} now check for
environment variables that might define the ISIS port number to
connect to.
{\tt isis\_init} checks for the variable {\tt ISISPORT} and
uses it if found.
{\tt isis\_remote} checks for the variable {\tt ISISREMOTE} and
uses it if found.
\item
Associated with this interface is a new routine {\tt isis\_probe(freq,timeout)}
that asks ISIS to check the liveness of a client at frequence {\tt freq}
seconds, shutting it down if there is no reply in {\tt timeout} seconds.
\item
A set of new mechanisms have been added for faster communication. These are called
the {\em bypass} protocol suite and include a group subset communication option
(called {\em process lists}) and a way to define user-supplied data transport routines.
A minimal interface to the transport layer has also been added, called
{\tt mbcast}.
\item
The address structure has been changed, and the routine {\tt addr\_cmp} no longer
looks at entry numbers at all. The old semantics of {\tt addr\_cmp}, in which
the entry numbers are compared under a wild-card rule, are still available
through an interface called {\tt paddr\_cmp}.
\item
The problem of group address pointers being deallocated when a process leaves
the group has been eliminated. Group addresses are now cached and the pointers
remain valid indefinitely. The overhead of this is low. The approach also
makes {\tt pg\_lookup} much cheaper.
\item
A new spooling and long-haul communication facility has been added to the system.
It can be used to build services that only run periodically, and to interconnect
physically remote ISIS clusters.
\item
The number of sites that can be connected to ISIS has been greatly increased,
to 255 per cluster plus an unlimited number of machines using the {\tt isis\_remote}
interface. The {\tt sites} file only lists the sites directly in the cluster.
\item
The limit on the number of members and clients in a process group has been
{\em decreased} to 32 because the BYPASS protocol is slow for group
membership changes with more than this number of members. If you don't
use BYPASS you could set PG\_ALEN in protos/pr\_groups.h
to a higher value (it used to be 128).
We plan to eliminate the limit on clients completely in V3.0 of the system,
but the limit on members is probably not going away anytime soon.
\item
A new routine {\tt cc\_terminate\_l} is supported, it sends a copy of the
termination message to an additional group or process destination as well
as terminating the coordinator-cohort computation, all in a single atomic action.
When the extra address refers to a group, the caller must belong to that group.
\item
A new routine {\tt pg\_detect\_failure} is available for detecting total failure
in a group to which the caller does not belong.
\item
The runtime memory requirements of the system have been reduced.
\item
ISIS can now be accessed from C++ and compiled or called from GCC.
The necessary type signatures are included.
\item
Some routines have become macros, such as {\tt msg\_delete} and {\tt addr\_cmp}.
This may result in problems compiling code that passed such routines as
addresses (there are usually ``real'' versions of them around too, for example
{\tt MSG\_DELETE} and {\tt ADDR\_CMP}).
\item
The task facility has been extended to support a task-level {\em select} mechanism,
called {\tt isis\_wait}, as well as a way to enter and leave the ISIS tasking
world dynamically, i.e. to exploit true parallelism on a multiprocessor.
\item
To benefit from the Mach copy-on-write message passing mechanism, ISIS now
uses Mach IPC if possible for its intra-machine communication. The change is
transparent but leads to a substantial performance improvement within Mach systems.
\item
Ways to refuse to participate in a state transfer or coordinator-cohort computation
have been added.
\item
The client dump format has been extended and improved.
\item
The SUNTOOLS graphics interface is being phased out in favor of
the various forms of X11, including Open Look.
We are looking at improvements to the X11 interface at the widget level, which
is currently not an option in ISIS
The recommended X11 interface has changed a bit (see the spread or grid
demos for examples).
\item
A new message library has been added that includes a way to put an indirect
data reference into a message, using a format {\tt \%*X}.
ISIS does a call-back to a user-supplied routine when finished with the pointer.
By combining this with the {\tt \%-X} format type in {\tt msg\_get}, all data
copying can be eliminated.
\item
Support for the floating point and double-precision data types has been added.
ISIS assumes that the IEEE standard floating point format is in use.
\item
The remote-exec utility has been extended to implement the remote user-id and
password options, and the parallel make demo has been fixed.
\item
An all-out attack on performance has greatly improved the speed of the
whole system.
\item
The broadcast ``guard'' facility has been eliminated.
\item
A bug prevents the simultaneous use of SUNTOOLS graphics application and the new
bypass mode software; things are fine if ISIS is compiled with BYPASS disabled
(no -DBYPASS on command line when compiling {\tt clib/cl\_bypass.c}).
However, compiling this way slows things down to the performance of the old V1.3.1 system.
\item
The fortran interface now supports function calls with underscores in
the variable names for use from fortran's that permit such names.
It is illegal to mix both styles of reference, however.
\item
A bug in the SUN4 lightweight context switch code was fixed,
permitting larger numbers of tasks and faster task-to-task switching.
\item
An unsupported feature permitting users to run multiple ISIS systems on
a single machine (i.e. to debug code that senses and reacts to remote
hardware failures) is now supported (see Chapter 16).
\item
A new style of message reception is supported. You specify the entry
routine as {\tt MSG\_ENQUEUE} and use {\tt msg\_rcv} to collect the
arriving messages one by one. There are performance advantages but many
virtual synchrony risks to doing this.
\item
Timeouts are supported in {\tt isis\_accept\_events()} and in {\tt bcast\_l}.
\end{enumerate}
Looking further into the future, we expect ISIS V3.0 to be available
sometime in late 1990. That version of the system
will will tolerate network partitioning and support scaling mechanisms suitable
for use in extremely large networks, with potentially thousands of nodes.
Information on future releases of ISIS is available from Cornell University
Contact isis@cs.cornell.edu to be added to the ISIS mailing list for
new technical reports or to receive printed notification of new releases.
We also urge ISIS users to follow the network newsgroup ``comp.sys.isis''
for discussion of ISIS-related topics, bug fixes, and so forth.
ISIS bugs should be reported to isis-bugs@cs.cornell.edu. We respond
promptly to any and all reports, however minor.
If you have problems installing ISIS on your system, we will be happy to help.
uded.
\item
Some routines have become macros, such as {\tt msg\_delete} and {\tt addr\_cmp}.
This may result in problems compiling code that passed such routiprstat.tex 666 420 420 11255 4350762514 6216 ISIS includes a facility for monitoring system behavior that can be useful for
identifying bottlenecks in software that is performing badly under heavy load.
Before running this tool, it is advisable to run the UNIX {\tt vmstat} command
\index{{\tt vmstat} contrasted with {\tt prstat}}
\index{{\tt prstat} command for monitoring ISIS}\index{ISIS monitoring utility ({\tt prstat})}
to confirm that ISIS is actually accumulating a significant amount of CPU time.
We mention this because ISIS requires between 350k and 500k of resident memory
to run effectively. On a UNIX system where very little memory is available,
ISIS has been observed to ``thrash to death'' due to heavy page faulting.
A common cause for such a problem is that some other program, perhaps an ISIS
client, has a memory allocation bug which is causing it to gradually consume larger
and larger amounts of virtual memory.
\index{thrashing, effect on ISIS performance}
\index{thrashing, causes of}
The symptom will be high page-fault rates for the ISIS protocols process and very
low levels of CPU utilization.
If ISIS is accumulating large amounts of CPU time, this suggests that it is working
hard. The {\tt prstat} command will print statistics to assist you in understanding
what the system is actually doing.
The arguments to {\tt prstat} are both optional. If included, the port number to use
should be specified and then the sampling interval in seconds.
{\tt prstat} assumes that if an argument is a big number, it represents a port number
and that a small number is a sampling interval.
The column headings used by {\tt prstat} are as follows:
\begin{description}
\item[TIME]
Wall clock time at which this line was printed.
\item[NT]
Number of tasks currently running in the protocols process.
\item[FORK]
Number of lightweight protocols process tasks created since the last line was printed.
\item[CTXT]
Number of internal context switches that occurred since the last line was printed.
\item[C+]
Number of cbcast protocols started since the last line was printed.
\item[C-]
Number of cbcast protocols finished since the last line was printed.
\item[A+]
Number of abcast protocols started since the last line was printed.
\item[A-]
Number of abcast protocols finished since the last line was printed.
\item[G+]
Number of gbcast protocols started since the last line was printed.
\item[G-]
Number of gbcast protocols finished since the last line was printed.
\item[GA]
Number of gbcast protocols that had to be aborted due to locking collisions.
\item[FAN]
Average number of messages sent per broadcast during this interval.
When a multi-round protocol runs, the FAN factor will be a {\em multiple} of the number of
destinations actually specified. FAN is thus a measure of how much work ISIS
needed to do to send the average broadcast done during this interval.
\item[CP]
Number of cbcast packets sent to remote sites this interval.
\item[CM]
Number of cbcast messages they contained, including piggybacked messages.
\item[CR]
Number of cbcast messages received and delivered locally, excluding piggybacked messages to
non-local destinations.
\item[DP]
Number of duplicate cbcast messages received and discarded.
\item[AS]
Number of times the system garbage collection protocol ran during this interval.
\item[SYS]
Number of system calls done by ISIS clients at this site during this interval.
\item[REP]
Number of messages sent by ISIS to clients at this site during this interval.
\item[FA]
Number of times a cached process group view had to be retrieved from a remote site
during this interval.
\item[LK]
Number of {\tt pg\_lookup} operations performed this interval.
\item[VC]
Number of changes to locally resident process group views that occurred during this interval.
\item[IS]
Number of packets sent from this site to other sites during this interval.
\item[IG]
Number of packets received from other sites during this interval.
\item[IM]
Number of messages received and accepted from other sites during this interval
(excludes intersite acknowledgements, duplicates).
\item[NL]
Number of locally generated entries in cbcast/abcast message store that can be
deleted next time garbage collection runs.
\item[ND]
Number of entries in the cbcast message store that can be
deleted next time garbage collection runs, including cbcast entries made by
remote sites, but excluding abcast entries.
\item[SZ]
Memory resident size of the protocols process, in K-bytes.
\item[PF]
Should be page-faults this interval, but prints as 0 for some reason.
\item[UCPU]
User CPU time expended running ISIS protocols, in milliseconds.
\item[SCPU]
System CPU time expended doing I/O on behalf of ISIS protocols, in milliseconds.
\item[\%]
(UCPU+SCPU) as a percentage of the total interval.
\end{description}
mber of abcast protocols finished since the last line was printed.
\item[G+]
Number of gbcast protocols started since the last line was printed.
\item[G-]
Number of gbcast protocols finished since the last line was printed.
\item[GA]
Number of gbcast protocols that had to be aborted due to locking collisions.
\item[FAN]
Average number ofquick.tex 666 372 212 75217 4727502074 6033 \newenvironment{qcode}{}{}
\newcommand{\lmark}{\makebox[0in][r]{\makebox[0.5in][l]{$\Rightarrow$}}}
\setlength{\parindent}{0em}
\setlength{\parskip}{1ex}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Miscellaneous}
\markboth{\hrf \rm Miscellaneous\hrf \bf Quick Reference}{\bf Quick Reference\hrf \rm Miscellaneous\hrf}
\subsection*{Constants}
\index{\tt IE\_XXXX}
\begin{qcode}\begin{verbatim}
IE_XXXX -- Isis error numbers
\end{verbatim}\end{qcode}
\subsection*{Subroutines}
\index{\tt isis\_perror}
\begin{qcode}\begin{verbatim}
isis_perror(string)
char *string;
\end{verbatim}\end{qcode}
\subsection*{Global Variables}
\index{\tt isis\_errno}
\begin{qcode}\begin{verbatim}
int isis_errno;
\end{verbatim}\end{qcode}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Sites and addresses}
\markboth{\hrf \rm Sites and Addresses\hrf \bf Quick Reference}{\bf Quick Reference\hrf \rm Sites and Addresses\hrf}
\subsection*{Types}
\index{\tt site\_id}
\index{addresses, structure defined}
\begin{qcode}\begin{verbatim}
typedef short site_id;
typedef struct {
union {
short u_process;
short u_groupid;
} ad_un;
u_char addr_site;
u_char addr_incarn;
u_char addr_entry;
... padding ...
} address;
#define addr_process ad_un.u_process
#define addr_groupid ad_un.u_groupid
\end{verbatim}\end{qcode}
\subsection*{Constants}
\index{\tt ISAPID}
\index{\tt ISAGID}
\begin{qcode}\begin{verbatim}
ISAGID -- (addr.portno == ISAGID) for group addresses
\end{verbatim}\end{qcode}
\subsection*{Macros}
\index{\tt SITE\_NO}
\index{\tt SITE\_INCARN}
\index{\tt MAKE\_SITE\_ID}
\index{\tt SITE\_IS\_UP}
\begin{qcode}\begin{verbatim}
SITE_NO (sid)
SITE_INCARN (sid)
MAKE_SITE_ID (sno, sincarn)
site_id sid;
char sno, sincarn;
SITE_IS_UP (sno, sincarn)
char sno, sincarn;
\end{verbatim}\end{qcode}
The term {\em cluster} refers to a set of ISIS sites,
and is limited to 254 machines, but with an unlimited number of
remote clients.
Inter-cluster communication is via the long-haul facility.
\subsection*{Subroutines}
\index{\tt paddr\_isequal}
\index{\tt addr\_isequal}
\index{\tt addr\_isnull}
\index{\tt addr\_ismine}
\index{\tt addr\_cmp}
\begin{qcode}\begin{verbatim}
int paddr_isequal (a1_p, a2_p) -- checks entry field too
int addr_isequal (a1_p, a2_p) -- ignores entry field
int addr_isnull (a_p)
int addr_ismine (a_p)
int addr_cmp (a1_p, a2_p)
address *a_p, *a1_p, *a2_p;
site_id sid;
address a_l[];
\end{verbatim}\end{qcode}
\subsection*{Global Variables}
\index{\tt NULLADDRESS}
\index{\tt my\_address}
\index{\tt my\_site\_no}
\index{\tt my\_site\_incarn}
\index{\tt my\_site\_id}
\index{\tt my\_process\_id}
\index{\tt my\_host}
\begin{qcode}\begin{verbatim}
address NULLADDRESS; -- Null address
address my_address; -- My address
int my_site_no; -- My site number
int my_site_incarn; -- My site incarnation
site_id my_site_id; -- My site id
int my_process_id; -- My process id
char my_host[64]; -- My host name
char site_names[64][...] -- Host names by site-id number
saddr get_addr_by_name(name,portno) -- INET address by host name, portno
\end{verbatim}\end{qcode}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Messages}
\markboth{\hrf \rm Messages\hrf \bf Quick Reference}{\bf Quick Reference\hrf \rm Messages\hrf}
\subsection*{Types}
\index{messages, {\tt message} data type}
\begin{qcode}\begin{verbatim}
typedef struct { ... } message;
\end{verbatim}\end{qcode}
\subsection*{Subroutines}
\index{\tt msg\_newmsg}
\index{\tt msg\_delete}
\index{\tt msg\_increfcount}
\index{\tt msg\_copy}
\index{\tt msg\_put}
\index{\tt msg\_putfld}
\index{\tt msg\_getfld}
\index{\tt msg\_gettype}
\index{\tt msg\_printaccess}
\index{\tt msg\_rewind}
\index{\tt msg\_gen}
\index{\tt msg\_getsender}
\index{\tt msg\_getreplyto}
\index{\tt msg\_gettruesender}
\index{\tt msg\_isforwarded}
\index{\tt msg\_read}
\index{\tt msg\_write}
\index{\tt msg\_fread}
\index{\tt msg\_fwrite}
\begin{qcode}\begin{verbatim}
message *msg_newmsg ()
int msg_delete (msg_p)
int msg_increfcount(msg_p)
message *msg_copy (msg_p)
int msg_put (msg_p, format, arg1, arg2, ...)
int msg_get (msg_p, format, arg1, arg2, ...)
int msg_putfld (msg_p, field, format, arg1, arg2, ...)
int msg_getfld (msg_p, field, pos_p, format, arg1, arg2, ...)
int msg_gettype (msg_p, field, pos_p)
int msg_printaccess(msg_p)
int msg_rewind (msg_p)
message *msg_gen (format, arg1, arg2, ...)
address *msg_getsender (msg_p)
address *msg_getreplyto (msg_p)
address *msg_gettruesender(msg_p)
int msg_isforwarded (msg_p)
message *msg_read (fd)
int msg_write (fd, msg_p)
message *msg_fread (file)
int msg_fwrite (file, msg_p)
message *msg_p;
int field;
int *pos_p;
char *format;
int fd;
FILE *file;
\end{verbatim}\end{qcode}
\subsection*{Format Strings}
Predefined format items:
\index{format strings}
\begin{qcode}\begin{verbatim}
%a address
%b bit vector
%c char
%d long int (4-byte)
%e event-id. Relevant only when using bcast `f' option.
%f single precision (float).
%g double precision (float).
%h short int (2-byte)
%l long int (4-byte)
%m message pointer
%s character array (null-terminated character string)
\end{verbatim}\end{qcode}
Capitalized forms ({\tt \%A}, {\tt \%C}...) are used for arrays of the
corresponding base type; a second argument is required in these cases
giving the length of the array in units of array-elements for {\tt msg\_put}, and
a place to put the length (may be given as a null pointer) for {\tt msg\_get}.
If the length is a constant (generally, 1), the notation {\tt \%X[len]} can be
used to specify the length, in lieu of a second argument. This works
for both {\tt msg\_put} and {\tt msg\_get} (if the length found in the
message doesn't match the given length, {\tt IE\_MISMATCH} is returned).
On a {\tt msg\_put}, the default is to copy from the
message into the place specified by the argument.
The notation {\tt \%*X} overrides this, and indicates that the system should
keep a pointer to the data area supplied by the caller.
This format should only be used for large objects (>128 bytes) and
requires a third argument (or a second argument if a fixed-size
format was specified, as in {\tt \%*X[100]}).
The extra argument is a procedure that ISIS will call back,
with the data pointer as an argument, when the message is deallocated
because the last reference to it has been deleted.
On a {\tt msg\_get}, the default is to copy from the
message into the place specified by the argument.
The notation {\tt \%+X} overrides this, and indicates that the system should
{\tt malloc} space for the new object and return a pointer to the space;
the user must {\tt free} this later.
In contrast, {\tt \%-X} yields a pointer directly into the message.
Arrays of strings and messages are not supported, but dynamic allocation is permitted
for arbitrary types include strings.
Thus {\tt \%-s} returns a pointer to the string in the message, and
{\tt \%+s} returns a pointer to a malloc'ed copy of the string.
{\small
\begin{qcode}\begin{verbatim}
-- Putting values into a message
msg_put(m, "%x", v); -- stores value v
msg_put(m, "%s", s); -- stores string
msg_put(m, "%X", v_a, n); -- stores array of n values
msg_put(m, "%A[1]", &a); -- stores a single address
msg_put(m, "%*A", A, L, f); -- stores L addresses from A, calls (*f)(A) when free
msg_put(m, "%*A[100]", A, f);-- stores from A[0..99], calls (*f)(A) when free
-- Retrieving values from a message
msg_get(m, "%x", &v); -- retrieves value into v
msg_get(m, "%s", s_a); -- copies string into s_a
msg_put(m, "%a", &a); -- retrieves a single address
msg_put(m, "%A[1]", &a); -- retrieves a single address
msg_put(m, "%+a", &a_p); -- malloc's an address and returns pointer
msg_get(m, "%X", v_a, &n); -- retrieves array into v_a,
number of elements into n
msg_get(m, "%-X", &v_p, &n); -- returns pointer to array in message
msg_get(m, "%-s", &s); -- returns pointer to string in message
msg_get(m, "%+X", &v_p, &n); -- returns pointer to malloc'ed array.
msg_get(m, "%+s", &s); -- returns pointer to malloc'ed string.
-- Collecting broadcast replies
bcast(..., "%x", v_a); -- collect values in array
bcast(..., "%X", v_aa, &n_a); -- collect arrays into v_aa,
number of elements into n_a
bcast(..., "%-X", ...); -- NOT ALLOWED
bcast(..., "%+X", v_pa, &n_a);-- collect pointers to malloc'ed arrays.
xxx v, v_a[], v_aa[][], *v_p, *v_pa[];
int n, n_a[];
char *s, s_a[];
address a, *a_p;
\end{verbatim}\end{qcode}
}
\subsection*{Subroutines for Defining Types}
\index{\tt isis\_define\_type}
\index{\tt msg\_convertchar}
\index{\tt msg\_convertshort}
\index{\tt msg\_convertlong}
\index{\tt msg\_convertaddress}
\index{\tt msg\_convertsiteid}
\index{\tt msg\_convertpgroup}
\begin{qcode}\begin{verbatim}
isis_define_type(formatletter, size, converter)
char formatletter;
int size;
int (*converter)();
msg_convertchar (char_p)
msg_convertshort (short_p);
msg_convertlong (long_p);
msg_convertaddress (address_p)
msg_convertsiteid (site_id_p);
msg_convertpgroup (groupview_p);
xxx *xxx_p;
\end{verbatim}\end{qcode}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Broadcasts}
\markboth{\hrf \rm Broadcasts\hrf \bf Quick Reference}{\bf Quick Reference\hrf \rm Broadcasts\hrf}
\index{broadcasts}
\subsection*{Constants}
\index{\tt ALL}
\index{\tt MAJORITY}
\begin{qcode}\begin{verbatim}
ALL -- (nrpl == ALL): wait for replies from all destinations
MAJORITY -- (nrpl == MAJORITY): wait unil a majority has replied
\end{verbatim}\end{qcode}
\subsection*{Subroutines}
\index{\tt bcast}
\index{\tt cbcast}
\index{\tt abcast}
\index{\tt gbcast}
\index{\tt bcast\_l}
\index{\tt reply}
\index{\tt reply\_l}
\index{\tt nullreply}
\index{\tt abortreply}
\index{\tt forward}
\index{\tt flush}
\index{\tt bcwait}
\index{\tt bcpoll}
\index{\tt bccancel}
\index{\tt bc\_getevent}
\index{\tt my\_eid}
\begin{qcode}\begin{verbatim}
int n = xbcast (addr_p, entry, fmt, arg1, ..., 0)
int n = xbcast (addr_p, entry, fmt, arg1, ..., nrpl, fmt, arg1, ...)
int n = xbcast_l (option_string, parameters)
int n = xbcast_l ("x", addr_p, entry, fmt, arg1, ..., nrpl, fmt, arg1, ...)
int n = xbcast_l ("l", addr_l, fmt, arg1, ..., nrpl, fmt, arg1, ...)
int n = xbcast_l ("m", addr_p, entry, msg_p, nrpl, msg_pa)
int bcid = xbcast_l ("f", addr_p, entry, ...)
int reply (msg_p, format, arg1, arg2, ...)
int reply_l ("m", msg1_p, msg2_p)
int reply_l ("c", msg_p, addr_l, fmt, arg1, arg2, ...)
int reply_l ("cx", msg_p, addr_l, fmt, arg1, arg2, ...)
int nullreply (msg_p)
int abortreply (msg_p)
int forward (fmsg_p, addr_p, entry, cmsg_p)
int flush ()
int n = bcwait (bcid)
int n = bcpoll (bcid)
int n = bccancel (bcid)
event_id *eid_p = bc_getevent (bcid)
char *option_string; -- Options string
address *addr; -- Destination address
int entry; -- Entry point for dest. address
address addr_l[]; -- Null terminated address list
char *fmt; -- Format string
xxx arg1, arg2, ... -- Arguments as described by fmt
int nrpl; -- Number of replies wanted
message *msg_p, msg1_p, msg2_p; -- Message pointers
message *fmsg_p; -- Message forwarder received
message *cmsg_p; -- Message to actually send
message *msg_pa[]; -- Array of message pointers
int n; -- Number of replies received
int bcid; -- Broadcast id
eventid *eid_p; -- Event id
\end{verbatim}\end{qcode}
\subsection*{Bypass protocols}
Bypass broadcast is used for messages to a single group to which the
sender belongs, or for replies to a group broadcast. There is no
bypass {\tt gbcast} protocol.
The user may supply a transport protocol using the {\tt isis\_transport}
interface to optimize for faster message delivery.
\subsection*{Global Variables}
\index{\tt isis\_nsent}
\index{\tt isis\_nreplies}
\index{\tt my\_eventid}
\begin{qcode}\begin{verbatim}
int isis_nsent; -- number of processes to which msg was sent
int isis_nreplies; -- number of replies received
event_id my_eid; -- Current broadcast event id
\end{verbatim}\end{qcode}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Process groups}
\markboth{\hrf \rm Process Groups\hrf \bf Quick Reference}{\bf Quick Reference\hrf \rm Process Groups\hrf}
\index{process groups}
\subsection*{Constants}
\index{\tt PG\_LOGGED}
\index{\tt PG\_DONTCREATE}
\index{\tt PG\_INIT}
\index{\tt PG\_XFER}
\index{\tt PG\_JOIN\_AUTHEN}
\index{\tt PG\_CLIENT\_AUTHEN}
\index{\tt PG\_CREDENTIALS}
\index{\tt PG\_MONITOR}
\begin{qcode}\begin{verbatim}
-- Keywords for options in pg_join
PG_LOGGED -- Recover state automatically after crashes
PG_DONTCREATE -- Flag that prevents group creation
PG_INIT -- PG_INIT, routine
PG_XFER -- PG_XFER, domain, send_routine, receive_routine
PG_JOIN_AUTHEN -- PG_JOJN_AUTHEN, auth_routine
PG_CLIENT_AUTHEN -- PG_CLIENT_AUTHEN, auth_routine
PG_CREDENTIALS -- PG_CREDENTIALS, cred_string
PG_MONITOR -- PG_MONITOR, mon_routine, arg
\end{verbatim}\end{qcode}
\subsection*{Subroutines}
\index{\tt pl\_create}
\index{\tt pl\_add}
\index{\tt pl\_remove}
\index{\tt pl\_delete}
\index{\tt pl\_makegroup}
\index{\tt pl\_rank}
\index{\tt pl\_lookup}
\index{\tt pg\_lookup}
\index{\tt pg\_join}
\index{\tt pg\_subgroup}
\index{\tt pg\_leave}
\index{\tt pg\_delete}
\index{\tt pg\_client}
\index{\tt pg\_signal}
\index{\tt pg\_rank}
\index{\tt pg\_inhibit\_joins}
\index{\tt isis\_inhibit\_joins}
\index{\tt allow\_xfers\_xd}
\begin{qcode}\begin{verbatim}
address *pg_lookup (search_path)
address *pg_join (gname, key, args, ..., key, args, ..., 0)
address *pg_subgroup (pgaddr_p, sgname, incarn, members, clients)
int pg_leave (gaddr_p)
int pg_delete (gaddr_p)
int pg_client (gaddr_p, cred_string)
int pg_signal (gaddr_p, signo)
int pg_rank (gaddr_p, paddr_p)
int pg_rank_all (gaddr_p, paddr_p)
int pg_inhibit_joins (flag)
address *pl_create (gaddr_p, alist)
void pl_add (list_p, paddr_p)
void pl_remove (list_p, paddr_p)
void pl_delete (list_p)
int pl_rank (list_p, paddr_p)
int pl_lookup (list_p)
int isis_inhibit_joins (entry_number)
int allow_xfers_xd (gname, XD_USER+xd, snd_routine, rcv_routine)
char *search_path; -- Search scope and group name
(``string'' or ``@scope:string'')
char *gname; -- Group name (null-terminated string)
char *sgname; -- Subgroup name (null-terminated string)
address *pgaddr_p; -- Parent group address
address *gaddr_p; -- Group address
address *paddr_p; -- Process address
address members[] -- Null terminated list of initial members
address clients[] -- Null terminated list of initial clients
int key -- Key word for optional parameters
int signo; -- Unix signal number
int flag; -- 1 (true) or 0 (false)
int xd; -- Transfer domain number (XD_USER + domain_offset)
int (*snd)(); -- State xfer out
int (*rcv)(); -- State xfer in
int entry_number; -- Entry number during which to inhibit joins
int (*auth_routine)(addr_p, cred_string);
char *cred_string; -- Credentials of joining process
\end{verbatim}\end{qcode}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Monitoring and watching}
\markboth{\hrf \rm Monitoring and Watching\hrf \bf Quick Reference}{\bf Quick Reference\hrf \rm Monitoring and Watching\hrf}
\index{\em monitoring}
\index{watch}
\subsection*{Types}
\index{\tt bitvec}
\index{\tt sview}
\index{{\tt groupview} data structure}
\begin{qcode}\begin{verbatim}
typedef struct { ... } bitvec;
typedef struct {
int sv_viewid;
site_id sv_slist[MAX_SITES + 1];
u_char sv_incarn[MAX_SITES + 1];
bitvec sv_failed;
bitvec sv_recovered;
} sview;
typedef struct {
int gv_viewid; -- View number
int gv_incarn; -- Incarnation number
int gv_flag; -- Indicates if this is cached or not
address gv_gaddr_p; -- Group address
char gv_name[PG_GLEN]; -- Group name
short gv_nmemb; -- Number of members
short gv_nclient; -- Number of clients
address gv_members[PG_ALEN]; -- Group members, terminated by NULLADDRESS
address gv_clients[PG_ALEN]; -- Group clients
address gv_joined; -- New member, if any
address gv_departed; -- Departed member, if any
int gv_refcount; -- Reference count
} groupview;
\end{verbatim}\end{qcode}
\subsection*{Constants}
\index{\tt MAX\_SITES}
\index{\tt MAXBITS}
\index{\tt PG\_GLEN}
\index{\tt PG\_ALEN}
\index{\tt W\_FAIL}
\index{\tt W\_RECOVER}
\index{\tt W\_JOIN}
\index{\tt W\_LEAVE}
\begin{qcode}\begin{verbatim}
MAX_SITES -- Maximum number of sites in the system
MAXBITS -- Maximum number of bits in a bit vector
PG_GLEN -- Maximum chars in a pgroup name
PG_ALEN -- Maximum number of members & clients per group
W_FAIL -- watch for a site to fail
W_RECOVER -- watch for a site to recover
W_JOIN -- watch for a process to join a group
W_LEAVE -- watch for a process to leave a group
\end{verbatim}\end{qcode}
\subsection*{Macros}
\index{\tt bis}
\index{\tt bic}
\index{\tt bit}
\index{\tt bisv}
\index{\tt bicv}
\index{\tt bclr}
\index{\tt bset}
\begin{qcode}\begin{verbatim}
bis (vec, b) -- Set bit number "b" in bitvec "vec"
bic (vec, b) -- Clear bit number "b" in bitvec "vec"
bit (vec, b) -- Return the value of "b" in "vec"
bisv (vec1, vec2) -- vec1 = vec1 | vec2
bicv (vec1, vec2) -- vec1 = vec1 & ~vec2
bclr (vec) -- Clear all bits of "vec"
bset (vec) -- Set all bits of "vec"
bitvec *vec, *vec1, *vec2;
int b;
\end{verbatim}\end{qcode}
\subsection*{Subroutines}
\index{\tt sv\_getview}
\index{\tt pg\_getview}
\index{\tt sv\_monitor}
\index{\tt pg\_monitor}
\index{\tt sv\_watch}
\index{\tt pg\_watch}
\index{\tt proc\_watch}
\index{\tt sv\_monitor\_cancel}
\index{\tt pg\_monitor\_cancel}
\index{\tt sv\_watch\_cancel}
\index{\tt pg\_watch\_cancel}
\index{\tt proc\_watch\_cancel}
\begin{qcode}\begin{verbatim}
siteview *site_getview ()
groupview *pg_getview (gaddr_p)
int mid = sv_monitor (sm_routine, arg)
int mid = pg_monitor (gaddr_p, pm_routine, arg)
int wid = sv_watch (sid, what, sw_routine, arg)
int wid = pg_watch (gaddr_p, paddr_p, what, pw_routine, arg)
int wid = proc_watch (paddr_p, pw_routine, arg)
int wid = pg_detect_failure (gaddr_p, pw_routine, arg)
int sv_monitor_cancel (mid)
int pg_monitor_cancel (mid)
int sv_watch_cancel (wid)
int pg_watch_cancel (wid)
int proc_watch_cancel (wid)
int (*sm_routine)(sview_p, arg);
int (*pm_routine)(gview_p, arg);
int (*sw_routine)(sid, what, arg);
int (*pw_routine)(paddr_p, what, arg);
siteview *sview_p;
groupview *gview_p;
int arg;
site_id sid;
address *gaddr_p, paddr_p;
int what; -- What to watch for: W_FAIL, W_RECOVER, ...
int mid, wid;
\end{verbatim}\end{qcode}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Coordinator-cohort}
\markboth{\hrf \rm Coordinator-cohort, State Transfer\hrf \bf Quick Reference}{\bf Quick Reference\hrf \rm Coordinator-cohort, State Transfer\hrf}
\subsection*{Constants}
\index{\tt ORIGINAL}
\index{\tt TAKEOVER}
\begin{qcode}\begin{verbatim}
ORIGINAL -- Coordinator is the original one
TAKEOVER -- Coordinator is taking over due to a failure
\end{verbatim}\end{qcode}
\subsection*{Subroutines}
\index{coordinator-cohort computation, \tt coord\_cohort}
\index{cc\_terminate, cc\_terminate\_l}
\begin{qcode}\begin{verbatim}
int coord_cohort (msg, gaddr_p, action_routine, got_reply, arg)
message *msg;
address *gaddr_p;
int (*action_routine)(msg, gaddr_p, how, arg);
int (*got_reply)(msg);
char *arg;
int cc_terminate(fmt, args, ...)
char *fmt;
int cc_terminate_l(addr_p, entry, fmt, args, ...) -- send CC to (addr,entry)
char *fmt;
\end{verbatim}\end{qcode}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{State transfer}
\index{state transfer}
\subsection*{Constants}
\index{\tt PG\_BIGXFER}
\begin{qcode}\begin{verbatim}
PG_BIGXFER -- Big state transfer
\end{verbatim}\end{qcode}
\subsection*{Subroutines}
\index{\tt xfer\_out}
\index{\tt xfer\_flush}
\begin{qcode}\begin{verbatim}
int xfer_out (position, format, arg1, arg2, ...)
int xfer_refuse ()
int xfer_flush ()
long position;
char *format;
xxx arg1, arg2, ...
\end{verbatim}\end{qcode}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Tasking}
\markboth{\hrf \rm Tasking\hrf \bf Quick Reference}{\bf Quick Reference\hrf \rm Tasking\hrf}
\index{tasks}
\subsection*{Types}
\index{\tt condition}
\begin{qcode}\begin{verbatim}
typedef struct { ... } *condition;
\end{verbatim}\end{qcode}
\subsection*{Subroutines}
\index{\tt isis\_init}
\index{\tt isis\_init\_l}
\index{\tt isis\_remote}
\index{\tt isis\_probe}
\index{\tt isis\_task}
\index{\tt isis\_entry}
\index{\tt isis\_input}
\index{\tt isis\_wait}
\index{\tt isis\_select}
\index{\tt isis\_input\_sig}
\index{\tt isis\_output}
\index{\tt isis\_output\_sig}
\index{\tt isis\_except}
\index{\tt isis\_except\_sig}
\index{\tt isis\_chwait}
\index{\tt isis\_chwait\_sig}
\index{\tt isis\_signal}
\index{\tt isis\_signal\_sig}
\index{\tt isis\_logged}
\index{\tt isis\_mainloop}
\index{\tt isis\_start\_done}
\index{\tt isis\_failed}
\index{\tt tk\_connect}
\index{\tt t\_fork}
\index{\tt t\_fork\_msg}
\index{\tt t\_fork\_urgent}
\index{\tt t\_wait}
\index{\tt t\_wait\_l}
\index{\tt t\_sig}
\index{\tt t\_sig\_urgent}
\index{\tt t\_sig\_all}
\index{\tt t\_scheck}
\index{\tt t\_set\_stacksize}
\index{\tt isis\_entry\_stacksize}
\index{\tt t\_on\_sys\_stack}
\begin{qcode}\begin{verbatim}
int isis_init (port_no)
int isis_init_l (port_no, options)
int isis_remote (mother_machine, port_no)
int isis_probe (freq, timeout)
int isis_task (task, name)
int isis_entry (entry, handler, name)
int isis_wait (nfd, imask, omask, xmask, timer)
int isis_select (nfd, imask, omask, xmask, timer) -- synonym for isis_wait
int isis_input (fd, i_routine, arg)
int isis_input_sig (fd, cond_p, arg)
int isis_output (fd, o_routine, arg)
int isis_output_sig (fd, cond_p, arg)
int isis_except (fd, e_routine, arg)
int isis_except_sig (fd, cond_p, arg)
int isis_signal (signo, i_routine, arg)
int isis_signal_sig (signo, cond_p, arg)
int isis_chwait (i_routine, arg)
int isis_chwait_sig (cond_p, arg)
int isis_logged (gaddr_p, entry)
int isis_mainloop (task)
int isis_start_done ()
int isis_failed ()
int isis_timeout (time, i_routine, arg, arg1)
int fdes = tk_connect (address) -- make UNIX stream connection
int t_fork (task, arg)
int t_fork_msg (task, msg_p)
int t_fork_urgent (task, arg)
int arg = t_wait (cond_p)
int arg = t_wait_l (cond_p, ``descriptive message'')
int t_sig (cond_p, arg)
int t_sig_urgent (cond_p, arg)
int t_sig_all (cond_p, arg)
int t_scheck ()
int t_set_stacksize (stacksize)
int isis_entry_stacksize (entry, stacksize)
int res = t_on_sys_stack (i_routine, arg)
macro THREAD_LEAVE_ISIS() -- enter pre-emptive world
macro THREAD_ENTER_ISIS() -- acquire ISIS mutex
int port_no;
int (*task)(arg);
int (*handler)(msg_p);
int (*i_routine)(fd);
int (*iv_routine)(fd_vec);
int nfd, *imask, *omask, *xmask;
struct timeval timer;
int time;
int arg, arg1;
int res;
message *msg_p;
int fd, *signo;
condition *cond_p;
int stacksize; -- Rounded up to multiple of 1k
\end{verbatim}\end{qcode}
\subsection*{Global Variables}
\index{\tt isis\_socket}
\begin{qcode}\begin{verbatim}
int isis_socket;
\end{verbatim}\end{qcode}
\subsection*{Options to isis\_init\_l (new in V2.0)}
\index{\tt ISIS\_AUTOSTART}
\index{\tt ISIS\_PANIC}
\index{{\tt /usr/bin/startisis} shell script (for auto-restart)}
\begin{qcode}\begin{verbatim}
ISIS_AUTOSTART -- Automatically start ISIS if necessary
ISIS_PANIC -- Panic if ISIS isn't up (else return -1)
/usr/bin/startisis -- Shell script for auto-restart
\end{verbatim}\end{qcode}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Transactions}
\markboth{\hrf \rm Transactions\hrf \bf Quick Reference}{\bf Quick Reference\hrf \rm Transactions\hrf}
\index{transactions}
\subsection*{Types}
\index{\tt x\_id}
\index{\tt x\_item}
\index{\tt x\_list}
\begin{qcode}\begin{verbatim}
typedef struct { ... } x_id;
typedef struct {
x_id id;
int outcome;
message *info;
} x_item;
typedef struct {
int len;
x_item items[];
} x_list;
\end{verbatim}\end{qcode}
\subsection*{Subroutines}
\index{\tt x\_begin}
\index{\tt x\_commit}
\index{\tt x\_abort}
\index{\tt x\_term}
\index{\tt x\_outcomes}
\index{\tt x\_outcomes\_flush}
\index{\tt xid\_cmp}
\index{\tt x\_getid}
\begin{qcode}\begin{verbatim}
int x_begin()
int x_commit(phases)
int x_abort()
int x_term(participant_name, on_prepare, on_commit, on_abort,
format, args, ...)
x_list outcomes =
*x_outcomes(participant_name)
x_outcomes_flush(participant_name, outcomes)
x_id id = x_getid()
int xid_cmp(id1, id2)
int phases; -- Either 1 or 2
x_id id, id1, id2;
char *participant_name;
char *fmt;
int (*on_prepare)(id);
int (*on_commit)(id);
int (*on_abort)(id);
x_list *outcomes;
\end{verbatim}\end{qcode}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Files}
\index{files}
\markboth{\hrf \rm Files\hrf \bf Quick Reference}{\bf Quick Reference\hrf \rm Files\hrf}
\subsection*{Libraries and Include Files}
\index{files, libisis1.a}
\index{libisis1.a}
\index{files, libisis2.a}
\index{libisis2.a}
\index{files, libisism.a}
\index{libisism.a.a}
\index{files, isis.h}
\index{\tt isis.h}
\begin{qcode}\begin{verbatim}
libisis1.a -- Toolkit
libisis2.a -- Run time support for ISIS applications
libisism.a -- Message
isis.h -- Include file for ISIS applications
\end{verbatim}\end{qcode}
\subsection*{Commands}
\index{\tt isis}
\index{{\tt cmd} tool}
\index{\tt rmgr\_cmd}
\begin{qcode}\begin{verbatim}
isis -- Startup isis
cmd -- Command utility
rmgr_cmd -- Update recovery manager restart database
\end{verbatim}\end{qcode}
\subsection*{Executables}
\index{\tt protos}
\index{{\tt rexec} utility program}
\index{news facility}
\index{\tt lmgr}
\index{\tt rmgr}
\index{\tt xmgr}
\begin{qcode}\begin{verbatim}
protos -- protocols process
rexec -- remote execution service
news -- news service
lmgr -- log manager
rmgr -- recovery manager
xmgr -- transaction manager
\end{verbatim}\end{qcode}
\subsection*{System Files}
\index{files, sites}
\index{files, rmgr.rc}
\index{files, rexec.log}
\begin{qcode}\begin{verbatim}
sites -- list of all sites running isis
rmgr.rc -- list of processes started by isis
x.log -- isis diagnostic log
rexec.log -- rexec diagnostic log
xmgr.log -- transaction manager diagnostic log
\end{verbatim}\end{qcode}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Implementation status}
\markboth{\hrf \rm Implementation status\hrf \bf Quick Reference}{\bf Quick Reference\hrf \rm Implementation status\hrf}
The current version of ISIS is
V2.0, and was released for beta testing in March 1990.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Protection status}
\markboth{\hrf \rm Protection status\hrf \bf Quick Reference}{\bf Quick Reference\hrf \rm Protection status\hrf}
All parts of the ISIS system itself are in the public domain.
Although proprietary products and software may be based on ISIS, the ISIS
system itself, or any simple modification of the system, or any port of the
system to other vendor hardware, must remain in the public domain
under the conditions of the support that was used to develop the system.
This manual is copyrighted by the ISIS Project.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
ons\hrf \bf Quick Reference}{\bf Quick Reference\hrf \rm Transactions\hrf}
\index{transactions}
\subsection*{Types}
\index{\tt x\_id}
\index{\tt x\_item}
\index{\tt x\_list}
\begin{qcode}\begin{verbatim}
typedef struct { ... } x_id;
typedef struct {
x_id id;
int outcome;
message *info;
} x_item;
typedef struct {
int len;
x_itereplicate.tex 666 372 212 23146 4673505551 6664 ISIS provides no special tool for managing replicated data\index{replicated
data}, because
replicated data ``falls out'' directly when using the broadcast primitives.
Each replicated data item (a variable, data structure, or even a file)
should be associated with a process group; all members will
hold the most current copy.
If it is necessary to have data items with varying degrees of replication,
you should create one process group for each replication domain.
Since process groups are cheap,
domains can be created and deleted dynamically.
The associated overhead will be small.
This chapter assumes a fairly low level of replication, and would
result in solutions that perform poorly for large process groups.
Such groups should be subdivided into multiple subgroups of perhaps 10
processes each, and updates should be done using a 2-tier scheme.
This will be automated in future releases of ISIS, which will
also contain optimized large-group message transport protocols.
\section{Normal case}
If synchronization is not an issue, replicated data can be read by reading any
local copy.
To update it, the best approach is to broadcast to all process group members
and wait for the local update to occur before continuing.
An update entry would be supported for each type of update the
application does:
\begin{verbatim}
/* Send a message to update `x' */
bcast(gaddr_p, UPDATE_X, "x = %d", new_value, 1, "");
...
update_x(mp)
message *mp;
{
msg_get(mp, "x = %d", &x);
if(addr_ismine(msg_sender(mp)))
reply(mp, "");
}
\end{verbatim}
Notice how an empty reply is used as an acknowledgement delay to the
initiator of the update ({\tt abortreply} would have worked here too).
We did not use {\tt nullreply} because {\em all}
process group members would have to call {\tt nullreply} rather than
just the sender, a costly proposition.
If the data is in a file, the recipients could just update the file.
This is discussed further in the chapter on distributed execution.
Virtual synchrony ensures that all members of the group see the updates in the
same order.
It is possible to build some very fancy interfaces on top of this basic
mechanism.
For example, the Linda S/Net system (reported in ACM TOCS Dec. 1985 and
in Scientific American Sept. 1987)
provides a ``shared tuple space'' through which processes interact.
Tuple space operations affect the whole tuple space atomically.
Linda supports four basic operations: {\em out}, which outputs
a new tuple (it need not be unique), {\em in}, which
reads and deletes a tuple matching a user-specified ``prototype'', {\em read},
which works like {\em in} but doesn't delete the tuple, and {\em update},
which updates a matching tuple in place.
The {\em read} and {\em in} operations block if no matching tuple is found.
Linda uses a simple but powerful pattern matching mechanism based on
values and wildcards that the user specifies in the fields of a pattern.
A Linda-like interface could be built within ISIS
by substituting a fancier {\tt update} routine for the one shown above.
The basic requirement is that the algorithm used to do pattern
matching be sensitive only to the order in which {\em in, out} and {\em update}
operations are received over the network.
We recommend that the user who is learning to work with ISIS for the
first time try this as a simple programming exercise.
The detailed pattern matching rules can be found in the TOCS article on Linda.
The advantage of the Linda interface is that it provides a simple way
for processes that run in parallel to share a set of tasks.
Specifically, the list of tasks is placed in the tuple space by a
central coordinator who then reads out the answer computed by the
process group.
The members of the process group compete for tasks to perform by
using {\em in} to remove a task from the bag of uncomputed tasks,
computing the result, and then using {\em out} to put a tuple
containing the result back in the bag.
A process that wishes to relinquish the machine on which it is running
simply drops out of the process group any time when it is not actually
computing -- or it puts the tuple on which it was working back in the bag
and then drops out.
\section{Replicated data with unordered broadcasts}
The solution above works if all communication with the group is done using
{\tt bcast}.
Since updates are done using {\tt bcast} and all {\tt bcast} messages are
totally ordered relative to one another, the replicated variable always
returns the correct value.
In Chapter 15 we will describe other broadcast primitives that do not have
this strong ordering property\index{replicated data, configuration data}.
These primitives are used in many settings within ISIS, and they
can lead to better performance, although
the ISIS programmer will never obtain an incorrect solution to a problem by not using them.
However, the existence of these other primitives does pose a problem.
Consider an application in which
a query is made on the group using the {\tt cbcast} broadcast,
which is not guaranteed to be ordered relative to {\tt bcast}.
If {\tt bcast} is used to update the variable
as above, it is possible that some members will receive the query
before the update and respond to the query based
on the old value, while others see the update first and respond
based on the updated value.
To avoid this, we need to update the variable in a way that is ordered
relative to {\tt all} other broadcasts and events.
ISIS provides another broadcast primitive called {\tt gbcast} that provides
precisely this kind of behavior.
The example below shows how it is invoked.
In general, {\tt gbcast} should be used to update replicated data that
determines the subsequent behavior of the group as a whole, when the
group is being accessed using broadcasts that are not totally ordered.
We call this kind of data {\em configuration data.}
Say that we have a data structure called
{\tt conf} and that we want to replicate it in such a way that all members will
see the same contents when they receive a given request---just as all
would see the same process group view when they receive a request.
Here is how to update it:
\begin{verbatim}
/*
* The value of conf is used when dividing work among group
* members
*/
int conf;
/* Send a message to update `conf' */
gbcast(gaddr_p, UPDATE_CONF, "conf = %d", conf, 1, "");
...
update_conf(mp)
message *mp;
{
msg_get(mp, "conf = %d", &conf);
if(addr_ismine(msg_sender(mp))
reply(mp, "");
}
other_req(rmp)
message *rmp;
{
... safe to decide how to process rmp using conf ...
}
\end{verbatim}
Here, when {\tt other\_req} is invoked, the value of {\tt conf}
can be used as part of the algorithm for deciding what action to take.
All
group members will ``see'' the same values in {\tt conf} provided that
they look at it before
there is a chance for some subsequent update to the configuration
to arrive.
Recalling the earlier discussion of virtual synchrony,
we could say that updates to
{\tt conf} are virtually synchronous
with respect to all other events that affect
the group.
\section{Synchronization needed}
When synchronization is needed to prevent concurrent users from
updating the same variables in parallel, other ISIS facilities come into play.
We discuss techniques for implementing synchronization in the next chapter.
\section{Updates from an exclusive sender}
At\index{replicated data, sender holds mutual exclusion}
the other extreme, consider a case where
absolutely no synchronization is needed.
Say that there is only one process that will update
or access a particular variable during some computation.
In this case a much cheaper
technique can be used to maintain it.
Unlike the above code, where
an update was done by broadcasting to
all the members including oneself,
updates on a private but replicated resource should be done locally,
and the remote copies updated
asynchronously using a version of the broadcast primitive called {\tt cbcast}.
In this case, the broadcast is invoked somewhat differently:
\begin{verbatim}
/* Update var locally */
var = new_value;
/* Send a message to update `x' */
cbcast_l("x", gaddr_p, UPDATE_VAR,
"var = %d", new_value, 0);
\end{verbatim}
Broadcast
option {\tt ``x''} indicates that the update message should be sent to all
group members other than the process initiating the broadcast.
We use it here because the sender will already have done the update.
Local execution is much faster than {\tt cbcast} message delivery,
which raises the question of when it becomes safe to assume that these
asynchronous updates have reached the remote destinations.
If this is a concern, for example in a piece of code
that is being backed up by remote processes and is about
to take an action that might fail,
it is a good idea to flush these updates to ensure they have been delivered to
all their destinations:
\begin{verbatim}
/* Ensure that remote updates have all completed */
flush();
\end{verbatim}
\section{State transfer and recovery from logs}
When the\index{replicated data, state transfer issues}\index{replicated data, logging}
state of a group includes replicated data, a state transfer
routine should be written to transfer this data to new members that
join the group while it is operational.
If the {\tt PG\_LOGGED} option is specified when joining the
process group, replicated data will automatically be recorded on log files
and restored on recovery, either from the log if no group members are
operational, or by transferring (copying) the data from an operational
member if the group is up.
When using this feature, entry points that update variables would have
to be declared using {\tt isis\_logentry}, described in Chapter 14.
ehavior of the group as a whole, when the
group is being accessed using broadcasts that are not totally ordered.
We call this kind of data {\em configuration data.}
Say that we have a data structure called
{\tt conf} and that we want to replicate it in such a way that all members will
see the same contents when they receive a given request---just as all
would see the same process group view when they receirexec.tex 664 372 212 13672 4673715731 6026
There are two ways to arrange for a program to run at a site \index{remote execution}
under the ISIS system.
The most common is to install the program at that site
and then use the command tool, described below,
to arrange that the program will be started each time it
crashes or the site crashes and recovers.
However, some applications need to ``pick'' a site from some
list of possible sites, for example sites at which the load
is currently low, and run a program at each of those sites.
Eventually, a sophisticated {\em resource manager} tool is
planned for this purpose; it will support all sorts of
predicates on sites and other resources, including load, accessibility
of necessary files, CPU speed, memory configuration, and so forth.
In the current version of ISIS, however, the mechanism is still unimplemented.
Given a list of sites and the number of instances of some program
that one wishes to run, the {\em remote execution} tool will
make an effort to start that number of instances of the program at the
specified sites.
The administrator for your network will have to make sure that the
{\tt rexec} program itself is already running at those sites or\index{{\tt rexec} utility program}
your {\tt rexec} requests will fail.
Moreover, if the tool is not running under the superuser account,
your program will run under the same account as {\tt rexec} was running
under, whether or not this is what you want.
However, under no circumstances
will your program run under the superuser account.
The tool is fairly unsophisticated: it runs the program with
the arguments specified by the user (and some additional arguments
passed through the UNIX ``environment'' mechanism), and then
monitors the process to make sure it succeeds in joining the
process group of the caller that requested the execution.
If the process fails before joining, the tool assumes that some
necessary file or other environmental feature needed by the program
was unavailable, and tries again at the next site in the list.
It eventually returns to the caller, giving the number of copies
that it started successfully.
Each program starts up with standard input and output set to the ``console''
on the machine on which it is run.
The interface is as follows:\index{\tt isis\_rexec}
\begin{verbatim}
nran = isis_rexec(nwanted, gid, sites, prog, args, env, user, passwd, alist)
site_id *sites;
char *prog, **args, **env, *user, *passwd;
address *gid, *alist;
\end{verbatim}
The arguments have the following meaning.
\begin {description}
\item [{\tt nwanted}] This gives the number of copies that should be run.
\item [{\tt gid}] This is the group the program is expected to join.
If specified as {\tt NULLADDRESS}, {\tt isis\_rexec} will assume that the program
started successfully if the execve() system call can be done successfully.
If a group address is given, {\tt isis\_rexec} will use {\tt pg\_watch} to
monitor for the join action, making sure that the join is successful and
starting the process at some other site if not.
{\em Note: this option is only available if the caller belongs to the group!}
\item [{\tt sites}] This is a list of sites at which to attempt the isis\_rexec;
they will be tried in order, with only one attempt per site.
Such a list can be obtained from {\tt sv\_getview} (listing all
operational sites), or (eventually) from the resource manager.
\item [{\tt prog}] This is a path name that will be used to find the program to
run.
The handling of relative path names is currently undefined.
\item [{\tt args}] This is an argument vector, in the form used by {\tt execve()}.
\item [{\tt env}] This is an environment vector, in the form used by {\tt execve()}.
\item [{\tt user}] This string specifies a user-name.
If the user-name is known on the remote machine, and the password
matches, the program will run under this user-id.
Otherwise, the program will run under the special UNIX-id {\tt nobody}.
However, if {\tt rexec} is compiled with REXEC\_REQUIRE\_SETUID defined,
and authentication fails, then the program will not be run.
\item [{\tt passwd}] The password for this user.
If {\tt rexec} is compiled with REXEC\_ALLOW\_RSH defined, then
the password need not be specified if the {\tt .rhosts} file for this
user is set up in a way that would have allowed an {\tt rsh} from the
caller's machine to the selected destination machine (see the UNIX
discussion of {\tt rsh } and {\tt .rhosts}.
\item [{\tt alist}] After the call, this argument will contain
a null-terminated list of process addresses corresponding to the
processes that were created.
\end {description}
The rexec-ed program runs under the same account that was
used to start ISIS running (perhaps the {\tt root} account!)
and in whatever directory ISIS happens to be running in.
This suggests that the facility should be used with care.
If authentication is not used, it runs applications under the
user-id ``nobody''.
\label{Note:rexec}
{\bf *Note:} On some Sun machines, if the isis {\tt rexec} utility is
running as root, and you issue an {\tt isis\_rexec} call with a username of
"nobody" (or a null username, as is done in the Twenty Questions demo
program) {\tt rexec} may fail with the following error:
\begin{verbatim}
ISIS: unable to exec : setgid(-2)Invalid argument
\end{verbatim}
{\tt Rexec} tries to start the program by calling setgid and setuid
giving as ids the values obtained from the passwd database (i.e.
/etc/passwd and/or yp). However the /etc/passwd supplied
by Sun gives -2 as the uid and gid for "nobody". These seem to be {\em illegal}
when supplied to setgid in this way.
The workarounds include:
1) run rexec as something other than "root", e.g. some special "isis"
userid. This means that jobs it starts up will run under that userid.
2) "fix" the /etc/passwd entry for "nobody" giving user and group ids
that are reasonable, i.e. 999.
All isis\_rexec actions are logged in a file called {\tt rexec.log}
which can be checked if problems are suspected.
Rexec checks the usual UNIX paths to find programs to run.
at some other site if not.
{\em Note: this option is only available irmgr.tex 666 420 420 6520 4255244035 5624
The\index{recovery manager} recovery manager provides a facility that automatically restarts an ISIS
application after a crash. For this purpose the recovery manager keeps a
database that lists all application programs to be restarted. Every entry in
the restart database is identified by a unique key, which the user supplies.
The database may be updated from within a program by calling the following
routine:\index{\tt rmgr\_update}
\begin{verbatim}
rmgr_update(key, program, argv, envp);
char *key;
char *program;
char *argv[], *envp[];
\end{verbatim}
A {\tt key} may be an arbitrary string of up to {\tt RMLEN} characters.
Program is a program name, and argv and envp are vectors
of argument and environment strings as in {\tt execve}.\footnote{Due
to the current format of the restart database argument and environment
values may not contain curly braces or commas.
Quote characters are not allowed in the key.
These restrictions will be removed in a future release.}
{\tt Rmgr\_update} searches the database for an existing entry with the
given key.
If such an entry is found, program, argv, and envp will be replaced
by the new values; otherwise a new entry will be created.
{\tt Rmgr\_update} may be used to delete an existing entry by specifying
program as a null pointer.
Instead of updating the restart database from within a program you
may use the command {\tt rmgr\_cmd}.\index{\tt rmgr\_cmd}
It has the following syntax:
\begin{verbatim}
rmgr_cmd site_no key
rmgr_cmd site_no [-E] key program arg0 arg1 ...
\end{verbatim}
The first form deletes an existing entry from the database;
the second form creates a new entry or updates an existing one.
{\tt Site\_no} is the ISIS site number on which the database is updated.
By default {\tt rmgr\_cmd} generates entries with an empty environment
list ({\tt envp}).
If the {\tt -E} option is used {\tt rmgr\_cmd} will copy its own environment
to the database entry.\footnote{It is currently not possible to specify
environment values explicitly; this will be changed shortly.}
The recovery manager keeps watching all processes that it has started up.
Should any of them abort, say due to a software error, it will automatically
be restarted.
However, if the program keeps crashing shortly after being restarted,
the recovery manager will eventually give up and print a message on
the system console.
To install a program while the system is running it is necessary to update
the restart database, and the program must explicitly register
itself with the recovery manager when it starts up.\index{\tt rmgr\_register}
It does so by calling
\begin{verbatim}
rmgr_register(key);
char *key;
\end{verbatim}
where {\tt key} must refer to an existing entry in the restart database.
Conversely, a program that wants to exit without being restarted immediately,
may do so by calling\index{\tt rmgr\_unregister}
\begin{verbatim}
rmgr_unregister();
\end{verbatim}
before exiting.
It is no error to call {\tt rmgr\_register} if the process is already
registered. This may be used to change the restart database entry associated
with the process while it is running.
{\tt Rmgr\_register} does not check whether the given key exists in the restart
database.
The recovery manager will print a console message if it is unable to restart
a process because it cannot find the entry in the restart database.
hich can be checked if problems are suspected.
Rexec checks the usual UNIX paths to find programs to run.
at some other site if not.
{\em Note: this option is only available isetup.tex 666 372 212 71534 4673444603 6060 ISIS is easy to install provided that you follow the
material in this section step by step.
\subsection{Rolling in the source files}
ISIS is provided in source form and must be compiled for each
system on which you run it, and potentially for each CPU architecture.
You can get started with ISIS using aprivate copy in your home
directory, though, and thgen convince the local administrative hackers
to install the system properly once you have a sexy application up and running to
show them.
Start by rolling in the ISIS source files.
You will find that these contain two top-level directories: {\tt isisv2.1} and
{\tt meta1.2} (or whatever the current release level is).
Start by changing the symbolic link in ``meta1.2/isis'' to point to the
``isisv2.1'' directory.
Next you'll want to build ISIS, then META. Details on the
latter subsystem, which not all users will need to run, are found in
meta1.3/doc/fd.doc.
\subsection{Compiling ISIS}
Change directory to the directory isisv2.1/MACHINE where MACHINE is
the name of your type of system, or MACH if you are running MACH (note that
for MACH you may need several copies of the binaries if you have a heterogeneous
hardware environment).
Run ``make''. ISIS will build itself.
Then run ``make install''. ISIS will install binaries and
libraries in a public area (skip this step if just testing out
a patch that you don't want to install yet).
If you have never built ISIS for this type of machine then
chdir to isisv2.1 and first run "make MACHINE".
(Note that this will automatically run ``make'' followed by ``make install''.)
By default we use a fairly ``weak'' level of optimization,
leave debug information in the binaries, and use the CC compiler.
You are free to change these options. The -g option actually slows ISIS down
and gives bigger code, but it lets you run dbx and make some sense of
activity within the ISIS system.
The CC compiler is not as good as GCC on many systems, and you may wish to
change to GCC if you have it installed.
Optimization level 3 (-O3) gives a significant speedup on some
systems, but may also be buggy on some systems.
We recommend its use on the SUN4, where we have tested extensively and
seen no problems at all.
What ISIS builds are a set of binaries in MACHINE/bin and a
set of libraries in MACHINE/clib/libisis[12].a and MACHINE/mlib/libisism.a.
You can run these immediately from MACHINE/run\_demos, which is
what we recommend that you do to test things out.
You can also make symbolic links from your home directory to these
files are run them under your account.
Once your group begins using ISIS more heavily, you may wish to
install it in a public place; we recommend that this be named /usr/spool/isis,
but this name can still be a symbolic link or an NFS mount point.
\subsection{Installing binaries and libraries}
Some systems install binaries to /usr/local/bin and libraries to /usr/local/lib, and
put symbolic links to the ISIS include files in /usr/include
This is optional and you can work with ISIS even if you don't take these
steps. If you do take them, you will still be better off installing
symbolic links than copying real files, since this lets you correct bugs
or install new versions of ISIS without worrying about tracking down all the
copies people may have made.
ISIS runs by ``default'' out of MACHINE/bin and MACHINE/lib, which
are the directories filled by the ``make install'' command.
We normally start ISIS from MACHINE/run\_isis.
When testing a change to ISIS you may not want to disrupt active
users. In this case, {\em don't} run ``make install'', but
rather run ``make'' and then test from MACHINE/run\_demos.
Once the system is healthy, you can run make install.
\subsection{Files ISIS needs to be able to run}
To actually run ISIS on a site, you will need to do the following:
\begin{enumerate}
\item [1]
Create a {\tt sites}
\index{files, sites} \index{configuration files used by ISIS}\index{setting up ISIS}
file, listing this site and others in the
network with which it may need to communicate.
Currently, the sites file should contain no more than 127 sites,
a restriction that will soon be eliminated.
There is a file named sites in MACHINE/run\_demos; a good start would be to
edit this to change the machine names to match your installation,
removing any entries for cornell machines.
Don't list machine that will only connect to ISIS via the {\tt isis\_remote} interface.
{\em Note: Users who need to emulate a network of multiple machines
by running {\tt isis} multiple times on a single machine should see
the section of Chapter 16 that discusses this topic.}
\item [2]
In the system ``services'' file, {\tt /etc/services}, list the various ISIS connection port
\index{{\bf /etc/services} file}
numbers.
This is an optional step; if you skip it you need to memorize one of the
numbers from the sites file, but are not prevented from running ISIS.
\item [3]
Create {\tt isis.rc}, a file that tells ISIS how to
\index{{\tt isis.rc} configuration file}
restart at this site by indicating the names and sequence in
which system utilities should be run.
The {\tt isis.rc} file in MACHINE/run\_demos is a reasonable one for starting
out, but you can tune things quite a bit if you like.
\item [4]
\index{{\tt isis} command}
Test-start the system by running the {\tt isis} command n MACHINE/run\_demos.
This is symbolically linked to the binary directory (e.g. MACHINE/bin/isis).
his command uses the {\tt sites} file to contact other sites at which
ISIS may already be running, and then initializes the ISIS subsystems
specified in the {\tt isis.rc} file.
A normal startup sequence looks like this:
\begin{verbatim}
honir% isis
Site 3 (thiazi.cs.cornell.edu): isis is restarting...
Is anyone there?
... found no operational sites, checking again just in case
Is anyone there?
site 3 (thiazi.cs.cornell.edu) doing a total restart
../bin/protos -d/usr/spool/isis/#.logdir
../bin/rexec 1602
../bin/news 1602
../bin/rmgr 1602
../bin/lmgr 1602
../bin/xmgr 1602
Site 3/0 is up!
site view has viewid 3/1
thiazi.cs.cornell.edu [site_no 3 site_incarn 0]
rm: /usr/spool/isis/3.logdir/logs/log_temp*: No such file or directory
Transaction manager: checking for termination of lmgr.
Log Manager: Startup completed, exiting normally
isis: Detected termination of <../bin/lmgr>
Transaction manager: lmgr has terminated, continuing.
\end{verbatim}
{\em Note: it is normal for the ISIS log manager, lmgr, to terminate
after scanning the logs for the recovering site.}
\end{enumerate}
If this doesn't work, some things to check are:
\begin{enumerate}
\item[1.] Have you up the symbolic links correctly.
\item[2.] Do the binaries exist? Are they executable?
You probably will want to recompile them locally.
\item[3.] Compiled for correct machine type (MCHTYPE=SUN3,SUN4,VAX...).
If you changed this, is it a type of machine that ISIS knows about?
(check, e.g., {\tt cl\_tasks.c})).
\item[4.] Is ISIS permitted to create the socket? Did you use a valid port number?
\item[5.] Is ISIS already running from a previous attempt (try {\tt ps ax} and look for {\tt isis} or {\tt protos}).
\item[6.] Did ISIS complain about something? Look at the log file output it
created in xxx.log (xxx will be the site number for this site).
\item[7.] Is your sites file formatted just like ours?
\item[8.] Is the directory in which you ran ISIS writable by isis?
\end{enumerate}
There are more suggestions on this below.
If all else fails, give us a call. We can probably figure things out.
\subsection{Compiling and linking your own code}
Once ISIS is set up, you may want to build applications of your own.
In general, such applications will include "isis.h" and hence
the compiler must be told where to find the ISIS include files
(isis.h pulls others in).
If these are in /usr/include or /usr/include/isis, the compiler should
find them automatically.
If not, specify -I on your CC command line.
Link your program with the ISIS libraries specified last on the command
line, in the following order: clib/libisis1.a, clib/libisis2.a, mlib/libisism.a.
The libisism can also be used in stand-alone fashion.
Run your program by either hard-wiring it to call {\tt isis\_init} with the
port number in the middle position of the list of port numbers in the
sites file (1602 in the run\_demos case).
(Later, if you put this information into /etc/services, you can pass a 0 to
isis\_init).
\subsection{Starting ISIS automatically on a machine}
ISIS can be set up in such a way as to restart automatically whenever your
machine is up.
This section concerns the details of doing this.
Unlike some UNIX utilities, the {\tt isis} command
should normally be run with {\tt \&} specified, to put it in the background.
It will print some things, hence standard and error
output should be redirected to the system console, {\tt /dev/console}.
If you want ISIS started automatically, it makes sense to put a line
to this effect into the file /{\tt /etc/rc.local}.
\index{{\tt /etc/rc.local} configuration file}
{\em ISIS should be the last thing started} on most systems.
Because isis uses many other system utilities,
if they are not up, {\tt isis} will not start correctly.
Several options are useful when starting ISIS automatically.
If you run {\tt isis} with the argument {\tt -An},
\index{{\tt A} option to {\tt isis}}
the system will restart itself automatically n*60 seconds after detecting an ISIS system
crash (even when this did not entail a full site-crash).
This could result in thrashing, however, if the reason ISIS
crashed has not gone away.
The flag should certainly not be enabled unless your
configuration has run comfortably for several weeks.
In the meantime, you will need to start the system by hand if it goes down.
If the ISIS system goes down and the site stays up,
ISIS will not restart itself more than 5 times (regardless of the {\tt -A} flag).
{\em Why should ISIS go down?}\index{ISIS system shutdowns}\index{crashes}
ISIS will shut down a site that becomes unresponsive for 80 seconds
when nothing is going on, or 20 seconds when someone is trying to send it a message.
This keeps the rest of the system alive, but means that if a workstation
loses contact with the NFS server ISIS could easily shut down and have to
be restarted.
You can figure out what caused ISIS to restart by looking at its log (see
the section on interpreting protocols process dumps).
\subsection{Making use of the ISIS\_AUTOSTART option}
ISIS support an automatic restart option under which, if the system is not
up when an application is started, an attempt will be made to restart it by
executing the shell script {\tt /usr/bin/startisis}.
You should install such a script if you plan to make use of this option.
\index{/usr/bin/startisis shell script}
\index{{\tt ISIS\_AUTOSTART} option to {\tt isis\_init\_l}}
\subsection{What's in the site file}
A sites file tells ISIS three things: what sites the system
is able to run on, what inter-net port numbers to use to
talk to ISIS at each of these sites, and what namespace ``scopes''
each site belongs to.
\index{scope, specification in {\tt sites} file}
Here's a typical one:
\begin{verbatim}
+ 1:1601,1602,1603 aditi.cs.cornell.edu bldg1,computer-science,aditi
+ 2:1601,1602,1603 hymir.cs.cornell.edu bldg1,computer-science,hymir
+ 3:1601,1602,1603 thiazi.cs.cornell.edu bldg2,computer-science,thiazi
+ 4:1601,1602,1603 sigyn.cs.cornell.edu bldg2,physics,sigyn
+ 5:1601,1602,1603 modi.cs.cornell.edu bldg2,physics,modi
\end{verbatim}
The plus sign just means that the line contains a valid
sites-file entry.
It is followed by the site-number (1-127), a colon,
3 internet port-numbers and the fully qualified site-name for each site.
Finally, a list of zero or more scope names are given for each site -- here, aditi and hymir are
in scopes ``bldg1'' and ``computer-science'', etc.
Each is also listed as being in a scope with its own name, so that a
search for, e.g., ``@aditi:name'' will work.
The scope information is used when a {\tt pg\_lookup} is done within
the ISIS system: the search will only occur with the scope specified
at the time of the lookup, and if no scope is given, the default is
to search within all the scopes to which the caller's site belongs.
The internet port
\index{port number}
numbers can be selected pretty randomly.
The first one is used when ISIS talks to ISIS at another site.
If desired, it can also be given as 0 and the actual
number listed as the {\tt isis/udp}
port in the system file {\tt /etc/services}
(e.g. the line labeled ``isis'', with information given as port-no/udp).
The second number tells clients how to contact ISIS at the site
where they run.
It is recommended that this number be listed in {\tt /etc/services}
in an entry identified as {\tt isis/tcp}.
The third number is used when ISIS starts up, to do a simulated broadcast to
other ISIS sites on the network.
Again, this can be given as 0, in which case it should
go in the services file as {\tt isis/bcast}.
For example, the services files could be edited to contain
the following 3 lines:
\begin{verbatim}
isis 1601/udp
isis 1602/tcp
isis 1603/bcast
\end{verbatim}
All sites normally use the same set of three port numbers.
Port numbers should be picked to not conflict with those used by
other applications.
Numbers in the range 1600 to 1600 should be safe.
Normal clients connect to ISIS using the port number given in the
tcp entry (1602, above), while remote ISIS clients connect using the
bcast entry (1603 in this example).
The sites file should be located in the directory where isis is
to be started up, probably not the root directory because of the
log files isis will generate.
Normally, we use ``/usr/spool/isis'', but you can pick some
other location provided that you tell ISIS what directory you used ({\tt -d} option, below).
\index{ISIS startup directory}\index{directory where ISIS is started up}
Normally, all ISIS sites share identical site-files. If desired, you can place
the sites file in a single well-known place on a network file system and specify
its location to ISIS using the {\tt -S} option to the protocols process.
However, this would prevent ISIS from restarting when the NFS unit is down.
If you change a sites file when ISIS is already running, use the {\sc rescan}
\index{{\sc rescan} command in {\tt cmd} tool}
command in the {\tt cmd} tool to tell ISIS to reread the file.
You only need to issue the command from one site; all sites will
reread the file at the same time.
\subsection{What's in the isis.rc file}
The restart file, named {\tt isis.rc}, usually is in the same place as the sites file.
It lists the system services to start. Here is a typical one that might be used
from directory /usr/spool/isis.
It assumes that a directory /usr/spool/isis/bin
contains the various executables.
and
arranges to create the files used by the log manager, etc, in the
directory /usr/spool/isis/\#.logdir, where `\#' will be replaced
at runtime by the appropriate site-number.
\begin{verbatim}
;; this is a typical comment
bin/protos -a -d/usr/spool/isis/#.logdir
bin/rexec
bin/news
bin/rmgr
bin/lmgr
bin/xmgr
\end{verbatim}
This tells the system to first start the ISIS protocols process,
and then once it comes up to run the remote executive
facility, the news program, the log manager cleanup program (which
will exit shortly after restarting),
the recovery manager and the transaction manager.
The first argument gives a path name to use and the second
shows how the program should be listed if people run the
UNIX {\tt ps} command.
The {\tt -a} flag tells the protocols process to {\em append} to its
system log; by default it wipes the log clean each time it is started.
The log called $n$.log, for site $n$, and is useful for understanding
why the system crashed. However, logs can get big over time, which is
why the {\tt -a} flag is not the default.
You may chose to omit services, for example the {\tt rmgr} if you don't want to
allow automatic restart of ISIS applications, the {\tt rexec} service if you
don't want to allow remote executions through the {\tt isis\_rexec} system call,
or {\tt xmgr} if you don't use ISIS transactions.
{\tt protos} must be run first and {\tt lmgr} must be run before {\tt rmgr}
and {\tt xmgr};
order is otherwise not important. Most user-defined ``services'' are added
using {\tt rmgr\_cmd}, not by modifying the {\tt isis.rc} file.
You may include comments in the {\tt isis.rc} file,
by putting them on a line by themselves that starts with a left-justified
pound-sign. A command line may not contain imbedded comments, however.
Be aware that the sequence in which the commands in isis.rc appear
does not necessarily correspond to any guarantees about the order
in which these commands will be executed. We plan to address the
issue of ``sequenced'' restarts in a future version of ISIS, but for now
the user must be cautious to avoid race conditions in which a client of
a service restarts before the service itself restarts.
\subsection{Telling ISIS where it will be running}
One common option to the ISIS protocols process is ``-d directory'', which
tells ISIS where to create its files when it
is run in a directory other than ``/usr/spool/isis'',
(the default, but you can specify it explicitly too).
If you start ISIS up in a place other than ``/usr/spool/isis'', and you
neglect to specify this option in the isis.rc file line for protos,
you will be unable to run the recovery manager, the transaction manager
and logging services.
Instead, error messages will be printed by each on the system console of your
machine. The directory is available from C programs
in a global variable called {\tt isis\_dir}.
It is a good idea to include the character `\#' somewhere in the log
directory pathname so that if two ISIS systems run on a single
file server, they still obtain different directories.
Thus, we tend to use a directory name like ``/usr/spool/isis/\#.logdir''.
One warning about the directory: ISIS needs to be able to create
files in this place, so make sure it is writable and has adequate room
for (a small amount of) ISIS log files and {\tt rmgr} and {\tt news} status files.
\subsection{Telling ISIS how fast to detect failures}
You can force ISIS to detect site failures more quickly than the default,
which is really very slow (perhaps 2 minutes or longer).
The default works well for systems that will run unsupervised and control
something critical; this gives ``adequate'' responsiveness without
errors every time that an NFS misbehaves.
A faster timeout can be specified using an option -fn to the bin/protos
command. The value of n (60 corresponds to the default) is a parameter
to the ISIS failure detection facility.
Basically, failures will be noticed within about 2*n seconds; -f10
thus works well for demos of the system.
Unfortunately, in systems with NFS servers, YP servers, etc, we find that
ISIS kills sites off too quickly if we run with the failure detector
in a ``sensitive'' mode, such as one obtains from -f10 or small values.
We do not recommend going much below 10 or above 90 (which is even slower
than the default).
\subsection{What's in the source directories}
The subdirectories are: \index{source directories}
\begin{description}
\item[clib]
The client runtime support and toolkit routines.
These are compiled into two libraries, libisis1.a and libisis2.a,
passed through {\tt ranlib} (on 4.3bsd systems this speeds up
later linking; other systems can skip this phase), and then
stored in an easily accessible place.
It is important that libisis1 be scanned before libisis2 and libisism scanned
last.
You may wish to create a symbolic link from /usr/lib/libisis1.a to libisis1.a, etc.
Users can then link by saying -lisis1 -lisis2 -lisism on the command line.
\item[mlib]
The message utilities.
You may wish to create a symbolic link from /usr/lib/libisism.a to libisism.a.
\item[protos]
The protocols process, on which all else depends.
This is compiled into a separate program, which the {\tt isis} command
actually executes.
\item[util]
A variety of utilities, including the interactive command tool ({\tt cmd}),
the {\tt isis} program itself (which just starts the system up),
the news program, the recovery manager, and the program that
implements {\tt rexec}.
\item[demos]
A few demonstration programs, with descriptions of what they do.
\item[include]
Symbolic links to the real source for {\tt .h} files.
For example, {\tt include/isis.h} is linked to {\tt clib/isis.h},
where the real source for this file resides.
No source files are actually stored in the ISIS include directory.
\end{description}
There are also subdirectories for the various object files
corresponding to these source files, one for each type of machine
you support.
Thus, SUN/clib/... are object files, for the SUN, created from the
clib source files.
SUN/clib/libisis1.a is a SUN version of the client toolkit library.
{\em Comment:} You can avoid the business of having three libraries on
non 4.xxxx UNIX systems.
Just concatenate the .o files into one library, called isis.a, in the order
they are currently put into libisis1, libisis2 and libisism.a.
Then you can store this in /usr/lib/libisis.a and people can link with -lisis.
However, order matters. If your system insists that you run ranlib on this library,
we recommand that you stick to three libraries.
If you DO run ranlib, it will (correctly) complain about multiple
definitions of some symbols.
This is because ISIS tries to keep the size of client programs down by
not including everything into every client. ISIS includes some dummy
entries which are supposed to resolve otherwise unresolved entry points
and keep the linker happy.
Ranlib, however, seems not to understand this approach, forcing us to split the
libraries up.
Some research groups need the ability to change ISIS and recompile
the modified version. To facilitate this, the system has a notion
of the {\em installed} versions of the binaries and libraries, located
in directories MACHINE/lib and MACHINE/bin respectively) and the
development versions (located in the object directories as described
above). The system makefile permits you to issue a ``make install''
to transfer the object-directory versions of ISIS code to the
installed version.
We recommend that you run out of the installed versions so that
operations like installing patches and rebuilding the system will
not disrupt normal users.
\subsection{Some common problems with ISIS}
Some problems are fairly common.
If ISIS crashes or you shut it down
and then you restart ISIS too quickly on a 4.3bsd UNIX system,
it may be unable to reacquire the ports needed for
execution.
You would see an output like this on the console:\index{UDP port number already in use}
\index{connection failures}\index{ISIS not operational at this site}
\begin{verbatim}
thiazi% isis
Site 3 (thiazi.cs.cornell.edu): isis is restarting...
Is anyone there?
... found no operational sites, checking again just in case
Is anyone there?
site 3 (thiazi.cs.cornell.edu) doing a total restart
../bin/protos
cannot bind to UDP port 1601: is ISIS already running at this site?
thiazi (3/128): -- panic --
client_init
connect: Connection refused
\end{verbatim}
What this shows is that ISIS restarted but when protos
first began to initialize itself it was unable to
bind to the network address it was supposed to use.
Notice that the system has not yet figured out what
incarnation to use and is temporarily using incarnation 128.
This error can also occur if you run ISIS twice at
one time, in which case it can show up in an even more
messy way. The basic error
message is always about a {\tt bind} failing, however.
You might also get the error message
\begin{verbatim}
Can't create query reception socket!
\end{verbatim}
which indicates that a previous attempt to start
ISIS at this site died so miserably that it left
chunks of the system around---in this case, a
process that will be called {\tt isis}, but is really
just monitoring the site.
You should kill the process by hand and try again.
A more successful restart looks like this:
\begin{verbatim}
thiazi% isis &
[1] 3914
thiazi% Site 3 (thiazi.cs.cornell.edu): isis is restarting...
Is anyone there?
site 3 (thiazi.cs.cornell.edu) doing a partial restart, coord is 5/0
../bin/protos -d/usr/spool/isis
../bin/rexec 1602
../bin/news 1602
../bin/rmgr 1602
../bin/lmgr 1602
../bin/xmgr 1602
Site 3/0 is up!
site view has viewid 5/2
modi.cs.cornell.edu [site_no 5 site_incarn 0]
thiazi.cs.cornell.edu [site_no 3 site_incarn 0]
Transaction manager: checking for termination of lmgr.
Log Manager: Startup completed, exiting normally
isis: Detected termination of <../bin/lmgr>
Transaction manager: lmgr has terminated, continuing.
Log Manager: Startup Completed
\end{verbatim}
Here, you can see that everything is happy;
the command tool could be used to query the state
of the system to confirm this.
In some situations there may also be a message from {\tt /bin/rm} to the
effect that a directory is empty:
\begin{verbatim}
rm: /usr/spool/isis/3.logdir/logs/log_temp*: No such file or directory
\end{verbatim}
The rmgr program will automatically restart any ``registered'' programs
that should run at this site, which will then rejoin
their process groups or reload their states from logs.
(Had the port numbers been set in the {\tt /etc/services} file, they
would not have been added to the command lines for the utility programs.)
Another common problem is that ISIS may start up on a machine but
always claim that no other ISIS sites are running.
This is typically due to errors in your /etc/hosts files.
You may need to speak to an administrator about such a problem; explain
that UDP packets are not getting from to
, or back.
Such problems are more common than you might expect.
Another such problem arises if the system is unable to determine the
name of the host on which it was started, or if the name that comes
back from the {\tt gethostname(3)} system call isn't actually the one listed
in your /etc/hosts file (or /yp/etc/hosts, if you run YP).
In these cases ISIS usually complains that it can't do the hostname
lookup or that the bind operation failed.
This may also explain problems where ISIS is running, but your program is
not able to connect to it. We find it easiest to use dbx to
breakpoint in the {\tt isis\_init} routine when problems like this occur, to
have a look at the arguments being passed around.
Bind operations also fail if you have the misfortune to pick UDP port numbers
that are already in use, or (in some buggy UNIX systems, notably Apollo),
if ISIS is shut down in certain ways that confuse the kernel into leaving
the port hung in the termination state.
The thing to keep in mind is that startup
problems are always due to configuration problems at your
local site or network---ISIS has been used at
more than 200 locations worldwide with relatively little trouble.
Contact us if you think you are encountering a problem you can't make sense of.
We can probably straighten things out in a few minutes and save you a lot
of wasted time.
\subsection{Shutting ISIS down}
A quick way to shut down ISIS at {\em all} sites where it is running is to
type {\tt isis -Z}.
\index{{\sc shutdown} command in {\tt cmd} tool}\index{{\tt Z} option to {\tt isis}}
The ISIS system will print a message about having been ``zapped'' and will
shut itself down.
To shut ISIS down at just one site, either {\tt kill} it with the UNIX
command or execute the ``shutdown'' command of the command tool.
It may take a minute before other sites to notice
(but not more than 3 minutes).
\subsection{Summary of installation procedure}
\begin{enumerate}
\item Create sites
\item Created isis.rc
\item Edit /etc/services (optional until ISIS installed for public use).
\item Create /site/spool/isis as a symbolic link to SUN4/run\_isis.
\item ln -s /site/subsys/ISIS\_2.0.dir/isisv2.0/include /usr/include/isis
\item Edited FILES/SUN4.MAKEFILE (uncommented MCHDEP line) to enable bypass comm. mode.
\item make SUN4
\item In /site/spool/isis, try: ./isis to test startup
\item Added following lines to /etc/rc.local:
\begin{verbatim}
if [ -f /site/spool/isis/bin/isis ]; then
/site/spool/isis/bin/isis > /dev/console 2>&1 &
(echo "starting ISIS system") >/dev/console
fi
\end{verbatim}
\item make clean (frees about 1.7 MB)
\item Create a file named /site/bin/startisis, which will be
automatically invoked by isis applications if isis is not running:
\begin{verbatim}
#! /bin/sh
#
/site/spool/isis/bin/isis > /dev/console 2>&1 &
if [ "$?" -eq 0 ]
then
(echo "Application initiated (re)start of ISIS system") > /dev/console
fi
#
# end of script
\end{verbatim}
Apply "chmod 755" to this file.
\item Create link:
\begin{verbatim}
ln -s /site/bin/startisis /usr/bin/startisis
\end{verbatim}
\item Modify MANPATH to include (at end):
\begin{verbatim}
/site/subsys/ISIS\_2.0.dir/isisv2.0/man
\end{verbatim}
\end{enumerate}
tt /bin/rm} to the
effect that a directory is empty:
\begin{verbatim}
rm: /usr/spool/isis/3.logdir/logs/log_temp*: No such file or directory
\end{verbatim}
The rmgrsigs.tex 666 420 212 6035 4655370306 5627 An isis application can receive UNIX signals normally (not to be
confused with the internal signals within tasks), but
the signal handler should not call any isis procedures directly,
as this could mess up data structures that were being
changed just when the broadcast arrived.
If you might need to, e.g., fork off a task to handle a
signal, it is recommended that ISIS itself be
asked to do this by calling {\tt isis\_signal(signo,handler,arg)}\index{\tt isis\_signal}
prior to invoking {\tt isis\_mainloop}.
The handler will be forked off once for each time the signal is received.
To disable this behavior, call {\tt isis\_signal(signo,0,0)}.
If you find it necessary to use UNIX IO signals to detect receipt of
messages of interest to ISIS, the two file descriptors on which
signals should be enabled are {\tt isis\_socket} and {\tt intercl\_socket}.
When the signal of interest might arrive while your program
is in an long compute loop, it is necessary to periodically pause to
permit ISIS to process pending signals and messages.
This code fragment illustrates the recommended solution:
\begin{verbatim}
/* Process a request, checking periodically lest it be canceled */
long_loop(mp)
message *mp;
{
/* Run some nasty long computation */
while(not-yet-done)
{
.... compute for a while ....
isis_accept_events(ISIS_ASYNC);
}
}
\end{verbatim}
The system routine {\tt isis\_accept\_events} will check for
\index{\tt isis\_accept\_events} pending messages to this process and pending signals, and
run the corresponding tasks if any.
When the argument passed is given as {\tt ISIS\_ASYNC}, as above, it
returns promptly to the caller when this is done (or immediately
if there is no work to do).
If the argument is given as {\tt ISIS\_BLOCK} it will block until some work
has definitely been done.
run asynchronously (specify {\tt ISIS\_ASYNC}).
specify {\tt ISIS\_TIMEOUT} followed by an additional argument
giving the timeout interval as a pointer to a {\tt struct timeval}
(see SELECT(3)). \index{\tt ISIS\_ASYNC}\index{\tt ISIS\_TIMEOUT}
\index{\tt ISIS\_BLOCK}
All the ISIS broadcast primitives
are oriented towards ``in-band'' delivery: they respect any of a number
of ordering properties, and hence the order in which a process
receives messages sent using them is very carefully enforced by the system.
A requirement in some systems is for a way to notify processes
as rapidly as possible when some exception condition occurs.
ISIS supports an out-of-band broadcast mechanism, based on
UNIX-style signals as opposed to out-of-band messages.
To send such a signal one invokes:\index{\tt pg\_signal}
\begin{verbatim}
pg_signal(gid, signo)
address gid;
\end{verbatim}
The sender must be a member of the group for this to be legal.
If so, the signal is delivered to all members of the group as
promptly as possible, and completely disregarding the ordering
properties of the previous section.
Out-of-band signals are intended mostly as a way to shut down an
application when serious problems have arisen.
ve UNIX signals normally (not to be
confused with the internal signals within tasks), but
the signal handler should not call any isis procedures directly,
as this could mess up data structures that were being
changed just when the broadcast arrived.
If you might need to, e.g., fork off a task to handle a
signal, it is recommended that ISIS itself be
asked to do this by calling {\tt isis\_signal(signo,handler,arg)}\index{\tt isis\_signal}
prior to invoking {\tt isis\_mainloop}.
Tstart.tex 666 372 212 175205 4753332410 6064
There is perhaps no better way to learn about a new system than by actually
using it, so let us begin by using ISIS to develop a simple application.
A typical ISIS application has parts that run on several different machines
and can be expanded to include more machines or be removed from some
machines even as the application is running.
The reason for distributing the application over a number of machines may
be to share the work, to obtain faster response time, or to be able to
continue operation despite the failures of some of the machines.
The example below has been chosen with the idea of exposing you to
most of the basic features of ISIS rather than to illustrate
a real-life ISIS application.
Yet, you will see that ISIS makes it simple to write a program that is
distributed, dynamically expandable, and fault-tolerant.
We will consider a distributed ``time-card''
service\index{time-card service} for an organization
with several departments.
The organization hires a number of temporary workers, who may work in
several different departments in a given week, covering excess work
where and when required.
Each department separately records the number of hours each temporary
employee works in that department.
The object of the time-card service is to enable someone
to give the service the name of an
employee and obtain the number of hours that the employee worked in the
various departments the previous week.
(The employee will presumably be paid on this basis.)
The time-card service will have to search through the records of the
individual departments before giving its response.
If the organization has a number of workstations connected by a
network (and they have ISIS),
this service could be implemented to run on several of these
machines instead of on just one machine.
One advantage is that the records of different departments can be scanned
in parallel on several machines, instead of one after the other on a single
machine.
This means that queries will be answered sooner.
A second advantage is that
we can ensure that the service remains available even when some of
the machines are not operational because of a failure or because
they have been taken off-line for maintenance or for other reasons.
Let us assume that each department keeps its records in a file
called {\tt department1}, {\tt department2}, {\tt department3}, and so
on, and that each record is simply a line giving the employee's name
followed by the number of hours he or she worked.
We will also assume that these files are available on all the
machines on which the service runs.
Before we go any further with our example, we need to introduce some of the
ideas that are central to programming with ISIS.
\section*{Process groups and broadcasts}
One of the most basic mechanisms that ISIS provides is a means of grouping
processes together and naming them as a unit.
A process group\index{process groups} could contain just a single
member, but will often consist of a number of processes residing on
machines anywhere in the system.
The membership of a process group could change with time, as new processes
join the group or as existing members leave it, either out of
choice or because of a failure of some part of the system.
A process can be a member of more than one process group.
ISIS also provides a broadcast mechanism that enables you to send a message
from a process to a process group.
To do this, the sending process first asks ISIS to ``look up'' the name of
the process group and obtains an ``address''\index{addresses}.
It then performs a
broadcast\index{broadcasts, usage of}\footnote{The term ``broadcast'' is
used in many different ways; we use it to mean the sending of a message
from one process to others using the ISIS broadcast function call.}
giving this address, the message,
and other relevant information as arguments.
The effect of this is to send a copy of the message to each of the
current members of the process group.
All members will eventually receive the message, although they may receive
it at different times.
The broadcast mechanism also allows the recipient of a message to send a
reply\index{\tt reply, short form},
or to forward\index{\tt forward} it to some other process that will send a
reply.
A process broadcasting a message can indicate that it wants to wait for
a specific number of replies, or that it wants a reply from all the
recipients of the message.
The broadcast function call returns when the requested number
of replies have been received.
If the group is not large enough, or if so many recipients terminate
(possibly because of failures) that the required number of replies cannot
be collected, ISIS will collect as many replies as possible and notify the
sender of the shortfall.
This reply mechanism allows the ISIS broadcast facility to be used as a
generalized remote procedure call mechanism\index{remote procedure call}.
The most common reason for making a set of processes into a process group is to
be able to broadcast messages to the group as a whole, even when its
membership may be changing.
As a special case, the simplest way to send a message to an individual
process is to make it a member of a process group containing only itself, and
broadcast messages to this group.
If a process will receive messages both as an individual and as a member of
a process group, it can be made a member of two process groups. Process
groups are cheap in ISIS, so this should not pose performance problems.
There is another reason why you might want to make a set of processes into
a process group.
ISIS provides simple-to-use tools that permit
members of a process group to access
shared or replicated information, to perform certain forms of coordinated
distributed execution, and to tolerate and recover from failures, among other
things.
If a set of processes wishes to make use of these tools, they would
typically be made members of a process group.
These tools will be described in later sections.
Process groups provide a convenient way of giving an abstract
name\index{process groups, symbolic names} to the
service implemented by the members of the group.
Other processes interact with the service using the name of the group and
the broadcast facility.
They need not be aware of the actual membership of the group.
This means that the group implementing the service can grow, shrink, move
to different machines, or add new capabilities without any interruption to
the service, and with the users of the service being unaware of these
changes.
It is this feature that makes it possible to develop applications that are
modular, dynamically expandable, and tolerant of failures.
Figure~\ref{fig:pgroup} provides an illustration of this.
It shows a process $P$ communicating with a process group
that implements a print service.
\begin{figure}[p]
\makebox[334pt][l]{
%\hspace{-168pt}
\vbox to 556pt{
\vfill
\special{psfile=kb90-001.ps}
}
}
\caption{Process groups provide a unit of abstraction.\label{fig:pgroup}}
\end{figure}
The important thing to notice is that in all three cases $P$ addressed its
message to the group {\tt PrintService}.
ISIS keeps track of the group membership and delivers the message to the
current members.
$P$ thus thinks of the group as an abstract service implementing some
function (print documents) and does not care about how many members the
group has or where they happen to be currently located.
It should be noted that the ISIS namespace is actually structured into naming
``scopes'' in order to limit the cost associated with this addressing
mechanism.
However, one can use ISIS without worrying about this issue, and we will defer
discussion of it until later in the manual.
\section*{Tasks and entries}
Readers familiar with UNIX will know that a UNIX ``process'' has a private
address space within which a single thread of control lives.
Some newer operating systems like MACH have introduced a notion of
lightweight tasks\index{tasks} that coexist within a single process,
sharing the address space.
Although ISIS was built on top of UNIX, we needed a task mechanism to
implement the system.
Consequently, a
process in an ISIS application is internally structured into a number of tasks.
An ISIS task looks just like a C function, and shares the same address
space and global variables as all the other tasks and functions in the
process.
The difference is that a task (and not just the function called {\tt main})
can be invoked by the system and start executing in response to certain
events, the most common of which is message delivery.
A task that is started up in response to a message delivery is called an
``entry''.
A process can have many entries and each one is given a different ``entry
number''.
When a message is sent, it is addressed to a particular entry number.
On delivery, (a pointer to)
the message is passed as a parameter to the entry,
which typically reads the contents of the message and acts
accordingly.
Programming with tasks is not very different from programming with regular
C functions, except for three things.
One is that you may need to link to special libraries, such as ``-llwp'' under SUN
UNIX.
Another is that when a task makes certain ISIS system calls, it is possible for a new task
to be started up and begin executing before the system call returns.
The original task will later continue from where it left off.
\index{tasks, blocking}
\index{tasks, limits on stack size}
As an example, consider a task that made a system call to broadcast a
message and is now waiting for replies.
If a new message arrives before the
replies come, another task (an entry) may be started up to handle the
new message even though the first task has not terminated.
Normally this poses no problem, but if the second task changes the values
of global variables, then the first task has to be aware of the fact that
their values might change between when it performs the broadcast system
call and when the system call returns.
System calls that allow other tasks to be started up before they return are
called
``blocking'' system calls because they ``block'' the execution of the task
that performs the call.
This documentation will indicate which system calls may block, and under
what conditions.
The other thing to keep in mind when programming with tasks is that they
may have a stack limit of 32k bytes\index{tasks, stack limit in}.
We say ``may'' because this limit does not apply on all systems (c.f. MACH)
and because it can be over-ridden if necessary.
A 32kb stack is sufficient for most purposes, but extremely deep nesting of
function calls (as might happen with recursive calls) or allocation of large
arrays or data structures on the stack may lead to the
limit being exceeded. ISIS will usually, but not always, detect this.
You can increase (or decrease) this limit with a one-time declaration, but
be aware that {\em all} tasks will have this new stack size, and if it is
made too large you may have problems with memory allocation.
There are various ways to get around this, including setting a per-task
limit or calling a single
subroutine without any stack size limit,
provided that the subroutine will not invoke any ISIS functions before it
returns.
This is convenient when, for example, an ISIS task must call a large piece of
software over which the ISIS programmer has no control, and which might not
respect the ISIS conventions.
We'll say more about this later.
Considerable thought has been given to the problem of porting ``old code''
to run under ISIS, especially in the case where the old code was not task-oriented.
By taking appropriate care, one can port ``old'' programs to ISIS in such a way that
only new code (added in conjunction with the port) is subject to any stack limit at all.
Moreover, under systems that already support a lightweight task facility, such as
SUNOS (LWP) or MACH (Cthreads), ISIS allows you to combine the ``native'' facilities
with the ISIS facilities in a completely unrestricted manner.
See Appendix F for a detailed discussion of both of these subjects.
\section*{Monitoring events}
A process can instruct ISIS to notify it when certain types of
events\index{\em monitoring} occur.
It does so by giving ISIS the name of a task to invoke (and an argument to
pass to the task) when that type of event takes place.
Among the types of events that can be monitored are process group
membership changes, process termination, and site failures.
This facility may be used to reapportion the work load
when new members join or leave a process group, or to take over work from a
process or site that fails, among other things.
\section*{Example: the time-card service}
From the discussion above, it follows that the time-card service should be
implemented as a process group consisting of a
number of processes running on different machines.
We will call the group {\tt timeservice}.
The reason for putting the processes in one process group is to be able to
use the broadcast mechanism to send queries to the group as a whole.
All the members of the time-card service execute the same program (which we
will call the service program).
Another program (the query program) is used to query the service.
We will develop both programs side by side, as is typical when programming
with ISIS.
Part of the code for the service program is shown below.
The value {\tt } is the port number\index{port number}
used by ISIS to talk to applications.
You will have to ask your system administrator for this.
If the port-number is given as 0, then ISIS will first check for
an environment variable ISISPORT\index{{\tt ISISPORT}},
and will use this number
if found. If not, ISIS will look in
the {\tt /etc/services}
file\index{{\tt /etc/services} file}, you may give
the value $0$ for {\tt } and ISIS will look it up for you.
You may need to consult with the person who installed ISIS on your
system if you try using port number 0 and your program won't start up.
ISIS V2.0 and beyond includes automated restart procedures that start ISIS
on a site when the first attempt is made to run an ISIS application there.
This is done using the interface {\tt isis\_init\_l}, as described
in Sec. 2.9. \index{{\tt isis\_init\_l}}
To connect to an ISIS running on a different machine, use {\tt isis\_remote},
as described in Sec. 2.10. \index{{\tt isis\_remote}}
A call to {\tt isis\_probe(freq,wait)} is used to tell ISIS to begin watching th
e
client.
ISIS will probe it once every {\tt freq} seconds and
kill the process is no response if received after {\tt wait} seconds.
By default, ISIS will not probe local clients and will probe
remote clients every 60 seconds, killing them if there is no response
within a further 60 seconds.\index{\tt isis\_probe}
\begin{verbatim}
#include "isis.h"
#define QUERY_ENTRY 1
address *gaddr_p;
int my_index;
int my_dept;
main()
{
int service_maintask();
int group_change();
int receive_query();
isis_init ();
/* Declare tasks and entry points */
isis_task (service_maintask, "service_maintask");
isis_task (group_change, "group_change");
isis_entry (QUERY_ENTRY, receive_query,
"receive_query");
isis_mainloop (service_maintask);
}
service_maintask()
{
int group_change();
/* Join the process group and monitor */
/* membership changes */
gaddr_p = pg_join ("timeservice",
PG_MONITOR, group_change, 0,
0);
isis_start_done();
}
group_change (gview_p, arg)
groupview *gview_p;
int arg;
{
int i;
/* Compute a unique index for this member */
i = 0;
while (!addr_ismine (&gview_p->gv_members[i]))
i++;
my_index = i + 1;
}
\end{verbatim}
In an ISIS application, the function
{\tt main}\index{main task}
usually just reads in the
command line arguments (this example has none), initializes ISIS,
declares tasks and entries, and sets off the main loop.
The argument to
{\tt isis\_mainloop}\index{\tt isis\_mainloop} is the first task to be run.
The first thing that {\tt service\_maintask} does is to join the process group
{\tt timeservice} and set up a monitor for group membership changes.
The first argument to
{\tt pg\_join}\index{\tt pg\_join} is the name of the group, and the last
argument is always a $0$.
In between these two arguments,
you may specify a number of optional keywords and the arguments
corresponding to those keywords.
Here, the keyword {\tt PG\_MONITOR} specifies that
the task {\tt group\_change} is to be called with the new group
view and the given argument (in this case 0) whenever the group membership
changes.
The function {\tt pg\_join} returns the address of the group, which is
stored in the global variable {\tt gaddr}.
When the main task is begun, ISIS inhibits the delivery of new requests from
other processes.
This ensures that you can do all the necessary initialization
before being asked to respond to other events like incoming messages.
The call {\tt isis\_start\_done}\index{\tt isis\_start\_done}
tells the ISIS system that the startup
sequence is completed.
This means that new tasks may be started up at the next blocking system
call or after the main task terminates.
ISIS automatically invokes {\tt isis\_start\_done} when the main task
terminates (the call is hence unnecessary here), but if the main task
remains in a loop and you forget to call {\tt isis\_start\_done}, your
application will simply execute the main task and do nothing else.
Notice that if you want to terminate the execution of an ISIS process,
you must call {\tt exit}
explicitly\index{{\tt exit} system call in ISIS clients}.
It is important to realize that an ISIS process can be active even if there
are no active
tasks within it (e.g. the main task has finished, and no new tasks have been
started).
In fact, ISIS is designed under the assumption that tasks will start up,
do the work they are supposed to do, and then terminate (by returning). This
applies to the main task as well as any others.
A process with no active tasks in it
is in fact waiting in {\tt isis\_mainloop}\index{\tt isis\_mainloop} for
work to do.
Later in the manual we describe
a way to obtain a printable dump of the internal state of
a process that includes a list of all active tasks within it.
Look now at the routine {\tt group\_change}.
This routine is called in each member of the group (recall that they are
all executing the same piece of code) whenever the
group membership changes.
Routines that monitor group memberships are always called with a pointer to
the ``group view'' structure as its first argument.
(The second argument is a value supplied by the user when the monitor is
set up).
The {\tt groupview} \index{{\tt groupview} data structure}
structure contains information about the process group
(Section 2.5).
In particular, it contains a list of the addresses of all the current
members {\tt gv\_members}.
The group view structure always orders this list according to the ``rank''
of the members.
The oldest member in the group has rank $0$, the second oldest rank $1$,
and so on.
So the rank can be used to give each member a unique index that
distinguishes it from the other members.
In the example,
{\tt my\_index} contains the value of the rank $+ 1$.
(Another way to obtain the rank is to call
{\tt pg\_rank(gaddr, paddr)}\index{\tt pg\_rank},
which returns the rank of process {\tt paddr} in the group {\tt gaddr}.)
The task {\tt group\_change} is invoked in a process when
the process joins the group for the first time (this, too, is a
membership change), so when {\tt isis\_start\_done} is called
{\tt my\_index} will have a defined value.
This index will change each time a member joins or leaves the
group.
If you actually type this example in, you may wonder how to
print things like ISIS address data structures in a human-readible
format.
ISIS has functions for this purpose. For example, {\tt paddr}
will print a single address (given a pointer to it), and
{\tt paddrs} will print all the addresses in a list of addresses,
such as the one {\tt msg\_getdests} (get destinations to which
a message was sent) returns.
The fields printed include the site-id and incarnation number
where the process is running and the UNIX process-id of the process.
\index{printing contents of a groupview structure}
For example, to print the the contents of a groupview {\em gip}:
\begin{verbatim}
print("Group <%s>: ", gip->gi_name);
paddr(gip->gi_addr);
print(" %d members, viewid %d.%d\n",
gip->gi_nmemb, VMM(gip->gi_viewid));
print("Members = [");
paddrs(gip->gi_members);
print("]\n");
\end{verbatim}
Here, the macro {\tt VMM} is used to take the unsigned long integer
{\tt viewid} apart into its {\em major} and {\em minor} numbers.
Major numbers change only when processes join and leave a group; the
minor viewid number also increments when certain types of
(very infrequent) broadcasts are received by the group members.
To receive incoming messages we must define an
entry.
The {\tt isis\_entry}\index{\tt isis\_entry} statement declares such an
entry, giving it the
entry number\index{{\em entry} number, defined} {\tt QUERY\_ENTRY}.
We now give the code for this entry.
We assume that an incoming query message contains a string giving the name
of an employee.
For now, we assume that there are at least as many members in the
process group as there are
departments, and that each member is responsible for searching through the
file for the department whose number is in {\tt my\_dept}.
Extra members have the value $0$ in {\tt my\_dept} and do nothing.
We will see later how the members decide which department they are
responsible for, and how they handle the case where there are fewer members
than departments.
\begin{verbatim}
receive_query (msg_p)
message *msg_p;
{
char query_name[MAX_NAMELEN];
int query_hours;
if (my_dept != 0)
{
/* Read employee name from message */
msg_get (msg_p, "%s", query_name);
/* Search through relevant file to find number */
/* of hours worked in my_dept and store it in */
/* query_hours */
/* Send reply message */
reply (msg_p, "%d%d", my_dept, query_hours);
}
else /* I am not responsible for any department */
reply (msg_p, "%d%d", 0, 0);
}
\end{verbatim}
An entry is called when a message addressed to its entry number arrives at
a process.
Its first argument is a pointer to the message.
This pointer may be used to read data out of the message using {\tt
msg\_get}, which has an interface similar to {\tt fscanf}.
In this case, it reads a string of characters out of the message and stores
it in {\tt query\_name}.
Let us shift gears for a minute and look at the query program.
It has a similar initialization sequence, and simply reads in a name from
the terminal, broadcasts a message to the {\tt timeservice} process group,
and prints out the replies.
This is what the code looks like.
\begin{verbatim}
#include "isis.h"
#define QUERY_ENTRY 1 /* From the service */
#define MAX_NAMELEN 64
#define NDEPTS 5
main()
{
int query_maintask();
isis_init ();
/* Declare tasks and entry points */
isis_task (query_maintask, "query_maintask");
isis_mainloop (query_maintask);
}
query_maintask()
{
address *gaddr_p;
char name[MAX_NAMELEN];
int dept[NDEPTS], hours[NDEPTS];
int i, retval;
/* Find address of timeservice process group */
gaddr_p = pg_lookup ("timeservice");
if (addr_isnull(gaddr_p))
{
printf ("Sorry! the service is not available\n");
exit();
}
isis_start_done();
/* Loop asking queries */
printf ("Enter employee name (^D to quit): ");
while (scanf ("%s", name) == 1)
{
/* Broadcast a message containing the name and */
/* collect replies */
do
{
rval = bcast (gaddr_p, QUERY_ENTRY, "%s", name, ALL,
"%d%d", dept, hours);
/* Exit on error */
if (rval <= 0)
{
isis_perror("Sorry! bcast error");
exit();
}
}
while (isis_nreplies != isis_nsent);
/* Print out time card */
printf ("Time card for %s:\n", name);
printf (" Dept. Hours\n");
for (i = 0; i < isis_nreplies; i++)
if(dept[i])
printf ("%8d%8d\n", dept[i], hours[i]);
/* Read in next name */
printf ("\nEnter employee name (^D to quit): ");
}
/* Quit by explicitly terminating this process */
exit(0);
}
\end{verbatim}
Notice the use of {\tt pg\_lookup}\index{\tt pg\_lookup} to obtain the
address of a process group.
The most significant part of the code, of course, is the call to {\tt
bcast}\index{\tt bcast} to send a message and collect the replies.
As you can see, the call first specifies the address and the entry number.
This is followed by a description of the data to be put into the outgoing
message (in a form similar to {\tt fprintf}).
Next comes the number of replies wanted.
The constant {\tt ALL} specifies that a reply is wanted from all the
processes to which a copy of this message was sent.
This is followed by a description of where to put the data that is
read out of the reply messages (in a form similar to {\tt fscanf}).
Unlike {\tt fscanf}, though,
each item is a pointer to an {\em array} of the given
type, because there could be more than one reply message.
The data from each reply goes into one element of each of the arrays.
Compare the call to {\tt bcast} in the query program with the calls to
{\tt msg\_get} and {\tt reply} in the service program; it's easy to see how
they match up.
When a call to {\tt bcast} returns, the global variable {\tt
isis\_nsent}\index{\tt isis\_nsent}
contains the number of processes that were sent a copy of the message and
{\tt isis\_nreplies}\index{\tt isis\_nreplies} contains the number of
replies collected (not counting ``null replies'' sent using
the special ISIS function {\tt nullreply}).
({\tt isis\_nreplies} is
also the return value of {\tt bcast}, unless an error
occurs, in which case an error code is returned.)
In our example, these two values would normally be equal.
However, if a process that was sent a message terminates before replying
(possibly because of a failure), ISIS will detect this and the call to {\tt
bcast} will return without collecting a reply from this member.
A process that terminates is automatically dropped out of any process group it
belonged to, so our query program reissues the broadcast if this
happens.
This next time around the message will be sent to the new membership of the
group and unless more processes terminate, a reply will be received
from all of them.
Now we shall see how each member is assigned a department (we still assume
that there are at least as many members as there are departments).
Let {\tt NDEPTS} be the number of departments.
Earlier we showed how each member can
compute a unique index ({\tt my\_index}).
A simple rule would be to make the member with index $i$ responsible for
department $i$, for $i \leq {\tt NDEPTS}$.
A member with index $i > {\tt NDEPTS}$ does nothing unless an active
member drops out of the group (possibly because of a failure).
If the index of a previously inactive member now becomes less than or
equal to {\tt NDEPTS}, it will begin to take part in the search.
Such a process is called a ``standby''.
Standbys make an application tolerant of failures, and are used
often in ISIS.
This application will tolerate the failure of as many processes as there
are standbys.
The complete code for the service program is given below.
Note that whenever the group membership changes, each member recomputes {\tt
my\_index} and opens the relevant file {\tt department}$i$ (and closes any
previously opened one).
\begin{verbatim}
#include
#include "isis.h"
#define QUERY_ENTRY 1
#define MAX_NAMELEN 64
#define NDEPTS 5
address *gaddr_p;
int my_index;
int my_dept = 0;
FILE *my_file;
main()
{
/* Same as before */
}
service_maintask()
{
/* Same as before */
}
group_change (gview_p, arg)
groupview *gview_p;
int arg;
{
char filename[16];
/* Compute a unique index for this member */
i = 0;
while (!addr_ismine (&gview_p->gv_members[i]))
i++;
my_index = i + 1;
/* Close previously open file, if any */
if (my_dept != 0)
fclose (my_file);
/* Reassign departments */
if (my_index <= NDEPTS)
{
my_dept = my_index;
/* Open relevant file */
sprintf (filename, "department%d", my_dept);
my_file = fopen (filename, "r");
if (my_file == NULL)
{
printf ("Could not open file %s\n", filename);
exit();
}
}
else
my_dept = 0;
}
receive_query (msg_p)
message *msg_p;
{
char query_name[MAX_NAMELEN], name[MAX_NAMELEN];
int query_hours, hours;
if (my_dept != 0)
{
/* Read employee name from message */
msg_get (msg_p, "%s", query_name);
/* Search through relevant file to find number */
/* of hours worked in my_dept */
query_hours = 0;
rewind(my_file);
while (fscanf (my_file, "%s %d", name,
&hours) == 2)
if (strcmp (query_name, name) == 0)
{
query_hours = hours;
break;
}
/* Send reply message */
reply (msg_p, "%d%d", my_dept, query_hours);
}
else /* I am not responsible for any department */
reply (msg_p, "%d%d", 0, 0); /* Say I won't be replying */
}
\end{verbatim}
We now have a complete implementation for the time-card service.
At this point you can compile the two programs
separately and link each of them with the ISIS libraries {\tt libisis1.a},
{\tt libisis2.a}, and {\tt libisism.a} (in that order).
\index{libisis1.a}
\index{libisis2.a}
\index{libisism.a}
\index{linking a program to ISIS libraries}
(The code and some sample data files are provided in the {\tt demo}
directory. Check with your site administrator to find out where these
subroutine libraries are located on your machine. )
You can then start up as many instances of the service program as you wish,
perhaps on different machines, and as many instances of the query program
as you want, also on different machines if desired, and begin issuing
queries.
To test the fault tolerance of the program, you may wish to add a loop to
the query program, so that for each name you type in, it broadcasts the same
query over and over, say 50 times.
Then as the service is being queried repeatedly, you can add new members by
starting up new instances of the service program on any machine or simulate
a failure by killing existing members (using \verb|^C| or the {\tt kill}
command).
You will see that as long as there are at least {\tt NDEPTS} members, the
\marginpar{\em If you kill a machine, there may be a long delay before
ISIS notices, depending on how it was configured at your site.}
sgrvice will continue to operate correctly.
You can even turn off the power from a machine that has a member on it.
In this case, there may be a pause because ISIS will wait for an answer
from this machine for about 45 seconds before timing out and deciding that it
has failed (NB: this delay varies with the setting of the ``-f'' parameter to
the ISIS protos server; see Appendix A for details).
To make this even more interesting and fun, we provide the code for an
implementation that works even if there are fewer than {\tt NDEPTS}
members.
The changes are simple.
If the number of members drops below {\tt NDEPTS}, some members take care of
more than one department.
The variables {\tt my\_dept} and {\tt my\_file\_p} become arrays, and the
replies carry arrays, too.
Notice the changes to {\tt bcast}, {\tt msg\_get}, and {\tt reply}
to handle arrays.
The details of the syntax are given in Section~\ref{fmt}.
Here is the service program.
\begin{verbatim}
#include
#include "isis.h"
#define QUERY_ENTRY 1
#define MAX_NAMELEN 64
#define NDEPTS 5
address *gaddr_p;
int my_index;
int my_ndepts = 0;
int my_dept[NDEPTS];
FILE *my_file[NDEPTS];
main()
{
/* Same as before */
}
service_maintask()
{
/* Same as before */
}
group_change (gview_p, arg)
groupview *gview_p;
int arg;
{
char filename[16];
int n_members;
int i;
/* Compute a unique index for this member */
i = 0;
while (!addr_ismine (&gview_p->gv_members[i]))
i++;
my_index = i + 1;
/* Record number of members in group */
n_members = gview_p->gv_nmemb;
/* Close previously open files, if any */
for (i = 0; i < my_ndepts; i++)
fclose (my_file[i]);
/* Reassign departments */
my_ndepts = 0;
for (i = my_index; i <= NDEPTS; i += n_members)
{
my_dept[my_ndepts] = i;
/* Open relevant file */
sprintf (filename, "department%d", i);
my_file[my_ndepts] = fopen (filename, "r");
if (my_file[my_ndepts] == NULL)
{
printf ("Could not open file %s\n", filename);
exit();
}
my_ndepts++;
}
}
receive_query (msg_p)
message *msg_p;
{
char query_name[MAX_NAMELEN], name[MAX_NAMELEN];
int query_hours[NDEPTS], hours;
int i;
if (my_ndepts > 0)
{
/* Read employee name from message */
msg_get (msg_p, "%s", query_name);
/* Search through relevant files to find number */
/* of hours worked in my_dept[i] and store in */
/* query_hours[i] */
for (i = 0; i < my_ndepts; i++)
{
fseek(my_file[i], 0, 0);
query_hours[i] = 0;
rewind(my_file);
while (fscanf (my_file[i], "%s %d", name,
&hours) == 2)
if (strcmp (query_name, name) == 0)
{
query_hours[i] = hours;
break;
}
}
/* Send reply message */
reply (msg_p, "%D%D", my_dept, my_ndepts,
query_hours, my_ndepts);
}
else /* I am not responsible for any department */
reply (msg_p, "%d%d", 0, 0); /* Say I won't be replying */
}
\end{verbatim}
And here is the corresponding query program.
\begin{verbatim}
#include "isis.h"
#define QUERY_ENTRY 1
#define MAX_NAMELEN 64
#define NDEPTS 5
main()
{
/* Same as before */
}
query_maintask()
{
address *gaddr_p;
char name[MAX_NAMELEN];
int dept[NDEPTS * NDEPTS], hours[NDEPTS * NDEPTS];
int i, j, k, rval;
int arraylen_1[NDEPTS], arraylen_2[NDEPTS];
/* Find address of timeservice process group */
gaddr_p = pg_lookup ("timeservice");
if (addr_isnull (gaddr_p))
{
printf ("Sorry! the service is not available\n");
exit();
}
isis_start_done();
/* Loop asking queries */
printf ("Enter employee name (^D to quit): ");
while (scanf ("%s", name) == 1)
{
/* Broadcast a message containing the name and collect */
/* replies */
do
{
rval = bcast (gaddr_p, QUERY_ENTRY, "%s", name,
ALL, "%D%D", dept, arraylen_1,
hours, arraylen_2);
/* Exit on error */
if (rval <= 0)
{
isis_perror("Sorry! bcast error");
exit();
}
}
while (isis_nreplies != isis_nsent);
/* Print out time card */
printf ("Time card for %s (based on %d replies):\n",
name, isis_nreplies);
printf (" Dept. Hours\n");
for (i = 0, k = 0; i < isis_nreplies; i++)
if(dept[i] == 0)
continue;
else for (j = 0; j < arraylen_1[i]; j++)
{
printf ("%8d%8d\n", dept[k], hours[k]);
k++;
}
/* Read in next name */
printf ("\nEnter employee name (^D to quit): ");
}
}
\end{verbatim}
\section*{Exercise}
As an exercise, you may wish to add an ``update'' entry to the service
program and write a corresponding update program.
The update program should read in new data for each department from the
terminal, and send this data in a message addressed to the update entry
number of the service.
The update entry should read the data out of the message and rewrite the
files {\tt department1}, {\tt department2}, \ldots .
If you have a shared file system, each member should rewrite only the files
corresponding to the departments it is responsible for; if you have
separate copies of the files at each member, each member should update all
the files.
An interesting observation is that because ISIS ensures that message
delivery events occur in the same order everywhere, you can continue to
send query messages to the service even as an update message is being sent.
It will never be the case that some members respond to a query based on
old information, while others respond based on updated information---an
important property for many applications.
\section*{But why does it work?}
It is possible that you don't believe that the program above will
work.\index{\em event ordering}
Here's what seems to be a counter-example.
Consider a query that is being sent to the service just about the same time
as one of its members fails.
Assume that the broadcast message reaches some members
before the failure is noticed there (i.e. before the routine {\tt
group\_change} is called there), while it reaches the other members after
{\tt group\_change} is called.
The first set of members respond based on the departments assigned to them
before the failure, while the second set respond based on the new
assignment of departments.
Clearly, it is now possible for two reply messages to contain information
about the same department, while the files for some other
departments may not be searched at all.
One of the main features of ISIS is that {\em such anomalous orderings do
not happen}.
The program above works correctly only because the notification of membership
changes and the delivery of broadcast messages occur in the same order at
all the members.
ISIS guarantees that all events (including broadcast message deliveries,
monitor notification, and a host of other functions to be discussed later)
occur in the same order at all processes.
This is true even for the notification of unpredictable events like process
or site failures.
It is this feature that makes programming with ISIS so simple.
In the absence of this ordering guarantee, every query in our example would
have to involve some kind of agreement protocol to ensure that the members
agreed on the current state of the group.
This not only muddles the application, but also leads to rather poor
performance.
ISIS insulates programmers from this level of complexity.
This allows them to
work in a simple and ordered environment and makes it possible to build
distributed applications that would otherwise be intractable.
We expand upon this concept below.
\section*{Ordering in ISIS}
To illustrate how the ordering of events works in ISIS, let us first look
at what an execution might look like if we didn't have any ordering
guarantees.
\begin{figure}[p]
\makebox[373pt][l]{
%\hspace{-185pt}
\vbox to 557pt{
\vfill
\special{psfile=kb90-002.ps}
}
}
\caption{An execution in a disordered world.\label{fig:disorder}}
\end{figure}
Figure~\ref{fig:disorder} shows three processes $A$, $B$, and $C$ as they
join and leave the process group {\tt Pgroup}, while two other
processes $P$ and $Q$ send messages to this group.
The figure also shows (in curly braces) the view each process has of the
current group membership.
Notice the confusion.
$A$ receives $m1$ before $m2$, while $B$ receives $m2$ before $m1$.
If $A$ and $B$ were maintaining copies of the same data structure, for
example, and $m1$ and $m2$ were requests to perform operations on this
data structure (e.g. add an item to a queue, or remove the first item on
the queue), then performing them in different orders could lead to the copies
of the data structure becoming inconsistent.
So additional communication is necessary before $A$ or $B$ can act on any
incoming request to avoid such inconsistencies.
Further in the execution we see that $A$ receives $m3$ when it thinks that
{\tt Pgroup} consists of just $A$ and $B$, while $B$ and $C$ receive
it when they think the group contains $A$, $B$ and $C$.
We have already seen in our time service example that handling a message
based on inconsistent views of the group membership could lead to incorrect
results.
Again, the only way to avoid this is to run some kind of agreement protocol
for every incoming request.
The figure also shows $m4$ being delivered to processes with inconsistent
group views, this time because of a failure.
How does ISIS help?
A programmer would like to think of actions like the delivery of a
broadcast message or the notification of a group membership change as a
single event, even though they consist of parts that
take place in more than one process and
possibly at different times.
For example, one would like to be able to write a program thinking,
``When the group receives the message broadcast by process $A$, do
something,'' or ``when the group is notified of the failure of process $B$,
do something else.''
One look at Figure~\ref{fig:disorder} should convince you that this kind of
thinking is not possible in a disordered environment.
The programmer is forced to consider the possible interleavings of the
various events and cannot think of distributed events
like message delivery or failure detection as single units.
A programmer using ISIS, on the other hand, is guaranteed that distributed
events like broadcast message deliveries, notifications of group membership
changes (even if they are due to failures), and many other kinds of events
will occur in exactly the same order in every process\index{virtual synchrony, global event ordering}.
In other words, interleavings like those in Figure~\ref{fig:disorder} will
simply never happen in ISIS.
\begin{figure}[p]
\makebox[373pt][l]{
\hspace{-176pt}
\vbox to 557pt{
\vfill
\special{psfile=kb90-003.ps}
}
}
\caption{An execution in the {\em ISIS} environment.\label{fig:order}}
\end{figure}
Figure~\ref{fig:order} shows an execution that {\em could} occur in ISIS
with the same set of events.
Notice how much simpler things become.
A programmer can work with the knowledge that each process has the same
view of the world when an event like a message delivery or membership
change occurs (because each process has seen the same preceding set of
events and in the same order).
Since the programmer knows the algorithm each process follows, he or she
can code each process to make unilateral decisions and know that
they will all make consistent decisions.
No special agreement protocols need be coded (ISIS does all this hard work
for you).
The result is a program that is simpler to code, easier to understand,
and quicker to debug.
One feature of Figure~\ref{fig:order} is worth elaborating.
Observe the delivery of $m3$.
At the time $P$ initiated the broadcast, process $C$ was not a member of
{\tt Pgroup}.
However, by the time delivery occurred, the members had been notified of
$C$'s join.
The question arises of whether $m3$ should be delivered to $C$ or
not.\index{\em group addressing}
$P$ broadcasts $m3$ to the group {\tt Pgroup}, and if it is treating
the group as an abstract entity, it should not be concerned with the actual
membership of the group.
On the other hand, if a member receives a broadcast message when it has a
certain view of the group membership, it is reasonable for it to expect
that the message was sent to all the other members in its view of the
group.
This is precisely what ISIS guarantees.
If a member of a process group receives a broadcast message addressed to
the group, then a copy of the message will also be sent to all other
members that it knows to be in the group at the time the message is
received (this membership may be different from when the send occurred).
Accordingly, copies of $m3$ are delivered to $A$, $B$ and $C$.
It follows from this that all the members will have exactly the same view
of the group membership when any particular broadcast message is received.
\section*{Virtual synchrony}
The discussion above should have given you an idea of\index{virtual synchrony, definition}
how the ordering of events ISIS makes it easier to write distributed programs.
To enable you to use ISIS most effectively, however, we urge you to
make one more conceptual shift in the way you view ordered events.
\begin{figure}[p]
\makebox[424pt][l]{
\hspace{-146pt}
\vbox to 503pt{
\vfill
\special{psfile=kb90-004.ps}
}
}
\caption{A synchronous execution.\label{fig:vsync}}
\end{figure}
Compare Figure~\ref{fig:order} with Figure~\ref{fig:vsync}.
In Figure~\ref{fig:vsync}, each process sees the same set of events and in
the same order as in Figure~\ref{fig:order}.
In other words, unless a process actually looks at a clock and records
the time, {\em the execution in Figure~\ref{fig:vsync} is indistinguishable
from the one in Figure~\ref{fig:order}}, at least from the point of view of
any individual process.
The difference is purely conceptual and lies in the way you, as the
programmer, look at the execution.
Figure~\ref{fig:vsync} shows distributed events like message delivery
happening everywhere at the same time (i.e. in synchrony).
Of course, ISIS does not guarantee that distributed events will actually be
synchronized; it only guarantees that they
will be ordered as in Figure~\ref{fig:order}.
But as we earlier observed, a process cannot distinguish the ordered
execution (Figure~\ref{fig:order}) from the synchronous one
(Figure~\ref{fig:vsync}).
What this means is that a programmer can code each process {\em as if}
the execution will actually be synchronous.
Any actual execution in the ISIS environment will only be ordered,
not synchronized, but a process will not be able to tell the difference
(unless of course it records the clock time).
So any code written as if the environment were actually synchronous will
work correctly when run under ISIS.
This is why we call the ISIS execution environment {\em virtually
synchronous}.
Virtual synchrony enables a programmer to write code while thinking of
distributed events as occurring everywhere at the same time---precisely the
kind of thinking we said was impossible in a disordered environment.
To actually make distributed events happen simultaneously
would be wasteful and inefficient because this would mean giving
up the vast potential for concurrency normally
available in a distributed system.
Instead, ISIS enforces enough order that
the resulting code works correctly, while not sacrificing concurrency.
This idea goes beyond message delivery and the notification of
events.
Each of the ISIS tools is designed with virtual synchrony in mind.
Even though they are quite complex and concurrent internally and often
involve a number of rounds of communication,
they can all be used as if their actions happen instantaneously and
indivisibly at all the relevant processes.
The state transfer tool described below is one such example.
The result of virtual synchrony, then, is to remove from the programmer
much of the complexity that arises from distribution, concurrency and
fault-tolerance, making
it as easy to write a distributed program as it is to write one for a
single central machine.
\section*{State transfer}
Let us go back to our time-service example.\index{state transfer}
The rule we used to divide the work has one big disadvantage.
Each time the group membership changed, a member could become
responsible for a completely new set of departments.
It had to close all its files and open new ones.
In a real-life implementation, parts of the file (or all of it) would be
read into main memory and fast access structures constructed to search
through the data.
All this would have to be redone each time the membership changed.
Let us instead consider a rule for dividing the work that avoids unnecessary
reassignment.
For example, we could adopt the rule that
when a member leaves the group, its departments are assigned to
the member responsible for the fewest departments.
A member that joins the group takes over half the departments from
the member with the most departments.
To compute the new assignments under
this rule, it is not enough for a member to know
just its own assignment of departments.
Each member has to know the assignments of all the other members as
well.
The code below shows how this might be done.
\begin{verbatim}
#include
#include "isis.h"
#define QUERY_ENTRY 1
#define MAX_NAMELEN 64
#define NDEPTS 5
#define MAX_MEMBERS 10
int n_assignments = 0;
struct
{
address his_addr;
int his_ndepts;
int his_dept[NDEPTS];
} assignment[MAX_MEMBERS];
address *gaddr_p;
int my_index;
int my_ndepts = 0;
int my_dept[NDEPTS];
FILE *my_file[NDEPTS];
main()
{
/* Same as before */
}
service_maintask()
{
/* Same as before */
}
group_change (gview_p, arg)
groupview *gview_p;
int arg;
{
char filename[16];
int n_members;
int i, small_i, failed_i;
address small_addr, *failed_addr_p;
/* Compute a unique index for this member */
i = 0;
while (!addr_ismine (&gview_p->gv_members[i]))
i++;
my_index = i + 1;
/* Record number of members in group */
n_members = gview_p->gv_nmemb;
/* Reassign departments from failed members */
failed_addr_p = &gview_p->gv_departed[0];
while (!addr_isnull (failed_addr_p))
{
/* Find member with fewest departments */
nsmall = NDEPTS + 1;
for (i = 0; i < n_assignments; i++)
if (assignment[i].his_ndepts < nsmall)
{
small_i = i;
nsmall = assignment[i].his_ndepts;
small_addr = assignment[i].his_addr;
}
/* Transfer departments from failed member */
failed_i = 0;
while (!addr_isequal (&assignment[failed_i].his_addr,
failed_addr_p))
failed_i++;
for (i = 0; i < assignment[failed_i].his_ndepts; i++)
{
assignment[small_i].his_dept[
assignment[small_i].his_ndepts++] =
assignment[failed_i].his_dept[i];
if (addr_ismine (&small_addr))
{
my_dept[my_ndepts] =
assignment[failed_i].his_dept[i];
sprintf (filename, "department%d",
my_dept[my_ndepts]);
my_file[my_ndepts] = fopen (filename, "r");
my_ndepts++;
}
}
/* Remove failed member from assignment list */
for (i = failed_i; i < n_assignments - 1; i++)
assignment[i] = assignment[i + 1];
n_assignments--;
/* Repeat for next failed member */
failed_addr_p++;
}
/* Assign departments to new members */
/* This code is similar to the above and is left as */
/* an exercise. Note that it must also handle the */
/* case when the first process joins an empty group, */
/* that is, when n_assignments = 0 */
}
receive_query (msg_p)
message *msg_p;
{
/* Same as before */
}
\end{verbatim}
The integer {\tt n\_assignments} and the array {\tt assignment} describe the
way the work is divided among the members.
It is an example of ``configuration data,''\index{configuration of a process group} that
is data that describes the state of a process group as a whole.
One problem still remains.
How are these configuration data to be initialized?
When a new process joins the group, its configuration data structures must
contain the assignment that was in effect just before the join,
otherwise {\tt group\_change} will go berserk when it is called.
In other words, the state of the group at the time of a join must somehow
be transferred to the joining process.
{\tt pg\_join}\index{{\tt pg\_join}, state transfer option}
provides an option that permits state transfer.
When a process joins a process group, the system picks one of the
existing members, and calls a user-specified ``send routine'' in that
process to obtain the state to be transferred.
A send routine calls the function {\tt xfer\_out} one or more
times,\index{state transfer}
each time giving it a pointer to a piece of the state, its type, its
size (if it is an array), and a non-negative integer ``locator'' associated
with this piece of state.
Each call to {\tt xfer\_out} resembles a call to {\tt bcast}
that does not wait for replies, and conceptually results in a message being
sent to the joining process.
(In practice, more than one such message is packed into a larger one and
shipped all together, but this is done simply to improve performance.)
At the joining process, a user-specified ``receive routine'' is called once
for each call to {\tt xfer\_out}, and it is given the corresponding
message and locator.\index{state transfer, receive routine}\index{state transfer, mechanism discussed}
\index{state transfer, locator}
It reads the data out of the message using {\tt msg\_get}.
Should the process sending the state fail before the entire state is
transferred, the system picks another member and calls its send routine.
The value of the last locator delivered to the joining process is passed as
a parameter to the send routine, so that the new sender can
continue the transfer from where it broke off.
(If no piece of state has been transferred, the send routine is called
with $-1$ instead of a locator.)
Depending on the application, a new sender could instead decide to
redo the transfer from the beginning or to transfer something entirely
different.
The system simply ships the data from the current sender to the joining
process, which uses the value of the locator to interpret the data.
In our example, a failure in the middle of a transfer invalidates the
current configuration data, so a restarted transfer must transfer the whole
state from the beginning.
When a process joins the group for the first time (that is, when the group
is first created), there is no state to transfer and so no send or receive
routines are called.
Instead an ``initialization routine'' is invoked.
This is specified by the {\tt PG\_INIT} option of {\tt pg\_join}, which
gives the name of the initialization routine.
In our example, the initialization routine sets up the configuration data
structure as appropriate for an empty group.
What about a group that doesn't exist because all its members have
failed?\index{\em logging, activated from {\tt pg\_join}}
Normally, ISIS treats this case by calling the initialization routine.
If desired, however, one can also arrange for the group state to be {\em
logged} on a disk file, by specifying another option ({\tt PG\_LOGGED})
to {\tt pg\_join}.
If this is done, the state will be recovered from the log
after a recovery from a total failure of the group.
Recovery from a log looks like a state transfer,
and is discussed later in the manual.
A final case arises if all
operational members fail in the middle of a state transfer to a new
member.
In this ``in between'' situation, the join of the new member will fail.
The normal course of action is to terminate the new process, and start it
again.
The second time around, it will join an empty group (i.e. create the group
from scratch) and proceed normally.
Here is the modified code.
\begin{verbatim}
#include
#include "isis.h"
#define QUERY_ENTRY 1
#define MAX_NAMELEN 64
#define NDEPTS 5
#define MAX_MEMBERS 10
#define NASSIGN_LOC 1
#define ADDR_LOC 2
#define NDEPTS_LOC 3
#define DEPTS_LOC 4
int n_assignments;
struct
{
address his_addr;
int his_ndepts;
int his_dept[NDEPTS];
} assignment[MAX_MEMBERS];
address *gaddr_p;
int my_index;
int my_ndepts = 0;
int my_dept[NDEPTS];
FILE *my_file[NDEPTS];
main()
{
/* Same as before */
}
service_maintask()
{
int group_change();
int send(), receive(), init();
/* Join the process group, monitor membership */
/* changes , and set up state transfer */
gaddr_p = pg_join ("timeservice",
PG_MONITOR, group_change, 0,
PG_XFER, 0, send, receive,
PG_INIT, init,
0);
if (addr_isnull (gaddr_p))
exit();
isis_start_done();
}
send (last_locator)
int last_locator;
{
address addr[MAX_MEMBERS];
int ndepts[MAX_MEMBERS];
int dept[MAX_MEMBERS][NDEPTS];
int i, j;
for (i = 0; i < n_assignments; i++)
{
addr[i] = assignment[i].his_addr;
ndepts[i] = assignment[i].his_ndepts;
for (j = 0; j < ndepts[i]; j++)
dept[i][j] = assignment[i].his_dept[j];
}
xfer_out (NASSIGN_LOC, "%d", n_assignments);
xfer_out (ADDR_LOC, "%A", addr, n_assignments);
xfer_out (NDEPTS_LOC, "%D", ndepts, n_assignments);
xfer_out (DEPTS_LOC, "%D", ndepts, n_assignments * NDEPTS);
}
receive (locator, msg_p)
int locator;
message *msg_p;
{
address addr[MAX_MEMBERS];
int ndepts[MAX_MEMBERS];
int dept[MAX_MEMBERS][NDEPTS];
int i, j, len;
switch (locator)
{
case NASSIGN_LOC:
{
msg_get (msg_p, "%d", &n_assignments);
break;
}
case ADDR_LOC:
{
msg_get (msg_p, "%A", addr, &len);
for (i = 0; i < len; i++)
assignment[i].his_addr = addr[i];
break;
}
case NDEPTS_LOC:
{
msg_get (msg_p, "%D", ndepts, &len);
for (i = 0; i < len; i++)
assignment[i].his_ndepts = ndepts[i];
break;
}
case DEPTS_LOC:
{
msg_get (msg_p, "%D", dept, &len);
for (i = 0; i < n_assignments; i++)
for (j = 0; j < assignment[i].his_ndepts; j++)
assignment[i].his_dept[j] = dept[i][j];
break;
}
}
}
init ()
{
/* Initialize for an empty group */
n_assignments = 0;
}
group_change (gview_p, arg)
groupview *gview_p;
int arg;
{
/* Same as before */
}
receive_query (msg_p)
message *msg_p;
{
/* Same as before */
}
\end{verbatim}
To illustrate one of the ways that the ``locator'' argument can be used, we
broke the state
into four pieces based on type here, and shipped each with a
separate call to {\tt xfer\_out}.
Observe how the send and receive routines match up, and how the locator is
used to distinguish the pieces of state.
In practice, the state would be divided into pieces based on function
rather than on type, with different data structures going into different
pieces, for example.
Also, there was really no need in our example to send the four pieces in
separate calls {\tt xfer\_out}; this was done merely to show you how each
call to {\tt xfer\_out} matched up with a call to the receive routine.
All the four pieces could have been sent in one call to {\tt xfer\_out}
using the format string {\tt "\%d \%A \%D \%D"}.
In this case, the receive routine would have been called just once, and the
data in the message could have been read out with one or more calls to {\tt
msg\_get}.
You may wish to write the code for this as an exercise.
The usual reason for calling
{\tt xfer\_out} several times is for the transmission of a multi-record
data structure.
A loop that iterates over the records would be used in this case, with
the locator giving the record number.
The state transfer operation illustrates another aspect of virtual
synchrony.\index{state transfer, {\em virtual synchrony} and}
Even though state transfer is a complex operation and may involve a number
of messages being sent to and fro, this program was written as if the
state transfer and the join occurred instantaneously and indivisibly.
For example, we did not bother about query messages being delivered to the
sending or the receiving
process while the state transfer was still under way, even though
the service may well be receiving queries as the state transfer is going on.
This works because
ISIS guarantees that transfer-state-and-join is a distributed event that
is ordered consistently relative to other events, and this ensures the
correctness of our program.
It is important to observe that the state transfer (or initialization) is
ordered before the join (i.e. the state transfer routines are called before
the monitor routines are invoked), so the correct state to transfer is the
state that existed just {\em before} the new member was added to the group,
as is the case in our example.
The exact sequence of events that occurs during a join is covered in the
next chapter.
\section*{Observations}
This chapter showed you how to develop a distributed
application using ISIS.
Our example essentially partitioned a database over a number of
sites and divided the work of responding to queries among these sites.
It was tolerant of failures and dynamically reconfigurable.
Notice that our query program did not have to be changed when members
joined or left the time-card service; in fact it could continue to run and
issue queries as this was happening.
We even changed the implementation of the service from one that required at
least {\tt NDEPTS} members to one that did not without changing the query
program.
Such an application would normally be extremely complicated to program, yet
we were able to develop our program in a few
relatively simple steps.
We also discussed how the ordering properties of the ISIS system and the
concept of virtual synchrony made it simple to develop a distributed
program by enabling distributed events to be viewed as single units.
Another aspect worth remarking on was the methodology we followed.
We started with a simple version of the time-card service, and
added more features to the service program (e.g. state
transfer), with very little of the existing code being
changed.
This is a consequence of the virtually synchronous approach, and
leads to a simple step-by-step development methodology that is
used very often in ISIS.
One begins with a simple form of the application and then adds more and
more functionality, while still re-using large parts of the existing code.
In most cases, adding functionality simply means adding options to existing
function calls and writing a few new functions.
As you become familiar with the various tools in ISIS and learn to
take advantage of the simplicity arising from the well-ordered environment
that ISIS provides, you will find that writing dynamic and
fault-tolerant distributed programs
can be as simple as writing programs for a single
central machine.
would be divided into pieces based on function
rather than on type, with different data structures going into different
pieces, for example.
Also, there was really no need in our example to send the four pieces in
separate calls {\tt xfer\_out}; this was done merely to show you how each
call to {\tt xfer\_out} matched up with a call to the receive routine.
All the four piecesstartseq.tex 666 372 212 25316 4673505555 6567 \section{Start sequence}
ISIS supports several
recommended start sequences, depending on whether you are coding
a new application or porting an old one to run under ISIS.
The most standard start sequence is as follows:
\index{start sequence for ISIS clients}\index{\tt isis.h}
\index{\tt isis\_task}
\index{\tt isis\_entry}
\index{\tt isis\_input}
\index{\tt isis\_input\_sig}
\index{\tt isis\_signal}
\index{\tt isis\_init}
\index{\tt isis\_init\_l}
\index{\tt isis\_mainloop}
\index{main task}
\index{\tt isis\_start\_done}
\index{\tt isis\_logentry}
\index{\tt pg\_join}
\begin{verbatim}
#include "isis.h"
main(argc, argv)
char *argv;
{
.... parse arguments ....
/* Connect to ISIS */
isis_init();
/* Declare all tasks, watch routines, etc. (other than entries) */
isis_task(routine, "name");
...
/* Declare all entry points */
isis_entry(number, routine, "name");
...
/* Declare routines that will accept input */
isis_input(fdes, routine, arg);
...
/* Declare signal handlers (or condition variables) */
isis_signal(signo, routine, arg); /* t_fork(routine,arg) if input avail */
isis_input_sig(fdes, cond, arg); /* t_sig(cond,arg) if input avail */
...
/* Enter main loop */
isis_mainloop(main_task);
/* ... no return */
}
main_task()
{
/* Join/create any process groups to which I belong */
gid = pg_join("name",
PG_KEYWORD, ....
0);
/* Activate logging for any logged entry points */
isis_logentry(group, number);
...
/* Finished: either call isis_startdone or just return from task */
return;
}
\end{verbatim}
Throughout the entire sequence, delivery of incoming requests
is inhibited.
Only when the main task terminates (by returning)
or calls {\tt isis\_startdone} will new requests begin
arriving.
During the join, the actual sequence of events that takes place
is as follows.
First, the system checks to see if the group
already exists.
If not, it either calls the user-specified {\em init} routine, if any, or
loads the state from a log.
It then sets up and triggers the group monitor routine, if any.
The initial view will show a single member and list it as having just joined.
\index{process groups, monitor triggered for first time}
\index{process groups, initial view after join}
Otherwise, the group already exists.
In this case, the
join request is authenticated if the group provides an authentication
routine.
Next, a state transfer is done domain by domain in
increasing numeric order of domain.
The state transferred will be
\index{process groups, state transfer}
as of {\em just before} the join took place.
Next, the monitor routine is set up and triggered.
Hence, the new member will
will see the membership transition corresponding to its own join
just like all the other members will, and at the same point in the overall
sequence of events that happen to the group.
Notice that the monitor and init routines specified in a {\tt pg\_join} will
be called during the start sequence, before the {\tt pg\_join} returns.
A process can also add itself as a client to another group during the start
sequence.
However, there is no requirement that this be done only during the start
sequence.
Additionally, although it is not recommended it is possible to do RPC or group RPC
interactions with processes and services elsewhere in the system
prior to the termination of the startup sequence.
As noted above, there are other acceptable startup sequences for ISIS.
The most important of these is for programs in which calling {\tt isis\_mainloop}
poses a problem.
A typical example would arise in the case of porting ``old code''
to run under ISIS.
Here, one has the problem that the existing program is unlikely to be task
structured.
Squeezing some unknown block of code into ISIS raises questions of stack size,
and hence can be a difficult undertaking, and the {\tt t\_on\_sys\_stack}
mechanism is awkward at best.
The alternative we recommend in this case is to arrange for a start
sequence that calls {\tt isis\_init()} directly,
does whatever group joins are desired in line (without ever calling
{\tt isis\_mainloop}!), calls {\tt isis\_start\_done()},
and then enters a large loop running the old code
and periodically calling {\tt isis\_accept\_events(flag)} to give
ISIS a chance to run.
{\em Note that your application will hang if you use this approach
but neglect to call {\tt isis\_start\_done()}.
This common error leads to a state in which your application can
communicate with other programs but can only receive replies from them.
\index{{\tt isis\_start\_done}, failure to call}
\index{hanging, failure to call {\tt isis\_start\_done}}
This is because ISIS is trying to make life as simple as possible
during startup by delaying the arrival of non-startup messages
until initialization is finished.
Use a client dump (see {\tt cl\_dump}) to determine if a process
has failed to call {\tt isis\_start\_done()}.}
A particularly convenient old program would be one that reads records from a
\index{converting old programs to run under ISIS}
file and processes them one at a time.
In such a program, the ``file'' can be simulated using an internal
buffer that is loaded by ISIS tasks on receipt of messages.
The existing program would loop, reading a record from the buffer and
processing it, then calling {\tt isis\_accept\_events(ISIS\_BLOCK)}
repeatedly until the buffer has more work in it.
This routine is also called {\em automatically} if your ISIS application
blocks for some other reason, say if the main task calls {\tt t\_wait}.
Notice that in this approach, the ISIS limit on stack size still applies
to any tasks ISIS starts up (say, on message receipt), but
does {\em not} apply to the old code.
Thus, say that you had a program that looked like this:
\begin{verbatim}
main(argc, argv)
char *argv;
{
.... parse arguments ....
for(;;)
{
printf(">> ");
fflush(stdout);
readline();
think();
}
}
\end{verbatim}
Were you to restructure this as an ISIS application, you would have to
be careful about two aspects.
First, the program as shown above probably does blocking IO when it
calls {\tt readline}.
Second, the original programmer is unlikely to have been particularly
careful about stack usage in either {\tt readline} or {\tt think()},
You could modify this into an ISIS program as follows.
First, you will want to change the algorithm so that
readline will run only when input is available on
{\tt stdin}.
Secondly, you will want to be careful that {\tt think}
runs on the main stack.
A simple approach is to structure the main loop of the program to
use the {\tt isis\_select} interface (similar to the UNIX
select(2) system call), or to
make use of a condition variable which ISIS
will signal when input becomes available
on {\tt fileno(stdin)}. While waiting on this condition variable,
the main task will block and this will substitute for a periodic call to {\tt isis\_accept\_events}.
In the code below, the {\tt isis\_select} interface is used.
Note that on UNIX systems supporting the {\tt fd\_set} macro, the
use of that macro is preferable to the sort of explicit shift
operator illustrated here (which only works for file descriptors in
the range 0..31).
\begin{verbatim}
#include "isis.h"
main(argc, argv)
char *argv;
{
.... parse arguments ....
/* Connect to ISIS */
isis_init();
/* Declare all tasks, watch routines, etc. (other than entries) */
isis_task(routine, "name");
...
/* Declare all entry points */
isis_entry(number, routine, "name");
...
/* Declare routines that will accept input */
isis_input(fdes, routine, arg);
...
/* Declare signal handlers (or condition variables) */
isis_signal(signo, routine, arg); /* t_fork(routine,arg) if input avail */
isis_input_sig(fdes, cond, arg); /* t_sig(cond,arg) if input avail */
...
/* Join/create any process groups to which I belong */
gid = pg_join("name",
PG_KEYWORD, ....
0);
...
/* Finished: call isis_startdone */
isis_startdone();
/* modified main loop */
forever
{
int input_mask; /* Use fd_set if available on your system! */
printf(">> ");
fflush(stdout);
/* Tell ISIS to delay this task until data can be read */
b input_mask = 1<