.sp 0.5i
.ce 2
Keeping watch over the flocks
at night (and day)
.sp 0.3i
.ce 8
Kenneth Ingham
University of New Mexico Computing Center
Distributed Systems Group
2701 Campus NE
Albuquerque, NM 87131
(505) 277-8044
ingham@charon.unm.edu
ucbvax!unmvax!charon!ingham
.sp 0.2i
.ce
Topic Areas: Applications, System management, Utilities
.sp 0.5i
The computing facilities offered by the University of New Mexico
Computing Center include three microvaxen, five large vaxen (780 or
bigger), and a Sequent B8000.  In addition to these Unix/VMS machines,
the UNMCC Distributed Systems Group (DSG) monitors a number of the
various microvaxen and sun workstations scattered across campus.  This
duty falls to the DSG Programmer designated as "DOC", or "DSG On Call",
who receives his beeper based on a monthly rotation schedule.
.sp
In the past, shell scripts running every six hours reported various
system statistics to DOC, who then scanned the output for signs of
possible trouble.  As the number of machines and the number of
potential problems grew, the mound of output that DOC had to process,
most of which merely indicated normal system operation, became
overwhelming.  Now, with several machines to monitor and only one
person acting in this capacity, DOC can often waste a tremendous amount
of time wading through system status reports, time which can be better
spent actually fixing system problems.
.sp
In response to this situation, the author developed a tool which 
introduces some intelligence into the machine's self-reporting, letting
the machine filter out messages indicating normal operation and
forwarding to DOC only those messages which point out trouble areas.
The result of these efforts is Watcher, a very general and extensible
system self-monitor.  Running more often than the set of
shell scripts, Watcher keeps closer tabs on the system; since it
delivers only a summary of potential problems, however, this extra
monitoring produces \fIno\fR corresponding increase in the demand on
the system manager.  No problems slip by unnoticed in the more concise
output, leading to an improvement in overall system availability as well
as the more effective utilization of the system manager's time.
.sp
Watcher was designed to be almost as flexible as DOC in deciding what
constitutes a problem with the system.  Running at intervals specified
in crontab, Watcher issues a number of
user-specified commands (each of which
delivers its output in a different format), parsing all or part of the
output from either the left or the right.  It compares this 
to the last such output obtained, checking for indications 
of a system abnormality.  Such signs might take the form of a
too abrupt change in a certain value (e.g. a process which suddenly
begins gobbling vast amounts of cpu time),
a value which exceeds the allowable maximum or minimum (such as a
an overly-full file system),
or an unacceptable change in a string value
(e.g. when "up" changes to "down").  For commands such as
"ps" whose output varies considerably with each run, specific 
parts of the output can be designated as a key; successive runs of
Watcher will home in on these key areas for their comparisons.
.sp
Since the user specifies not only the commands Watcher will execute and
the time lapse between successive runs, but also the aforementioned
parameters which indicate system anomalies, Watcher can easily be seen
as a very flexible, general system monitor.  Its use at UNM has provided
a marked increase in the productivity of the system manager, which has
led in turn to the increase in the reliability and availability of the
systems at UNMCC.
