.TI Keeping watch over the flocks
.TI by night (and day)
.AU 7
Kenneth Ingham
University of New Mexico Computing Center
Distributed Systems Group
2701 Campus NE
Albuquerque, NM 87131
(505) 277-8044
ingham@charon.unm.edu  or ucbvax!unmvax!charon!ingham
.AB
Over the last several years, the number of machines maintained by the
University of New Mexico Computing Center has increased rapidly, yet
the number of system managers monitoring these systems has remained
static.  Consequently, the system managers were faced with the task of
watching more and more machines; since only one system manager is on
call at any time (known affectionately as "DOC"), this soon proved to
be an unacceptable situation.  Shell scripts running every six hours
gave some assistance; this was offset by the fact that the scripts
generated a great deal of output indicating normal system operation,
which the system manager still had to scan carefully for signs of
trouble.  This paper describes \fIwatcher\fR, a flexible system monitor
which watches the system more closely than the human system manager
while generating less output for him to examine.
.sp
Running more often than the above mentioned set of shell scripts,
\fIwatcher\fR is able to keep closer tabs on the system; since it
delivers only a list of potential problems, however, this extra
monitoring produces \fIno\fR corresponding increase in the demand on
DOC.  No problems slip by unnoticed in the more concise output,
leading to an improvement in overall system availability as well as the
more effective utilization of the system manager's time.
.BD
.SE 0. Acknowledgments (I couldn't have done it without you)
I would like to thank Leslie Gorsline for her assistance in the writing
of this paper.  Without her, this paper might not have been.  Also
thanks to the UNMCC distributed systems group for their comments that helped
improve \fIwatcher\fR.
.SE 1. Background (the problem)
The computing facilities offered by the University of New Mexico
Computing Center (UNMCC) include three microvaxen, five large vaxen
(780 or bigger), and a Sequent B8000.  In addition to these Unix/VMS
machines, the UNMCC Distributed Systems Group (DSG) monitors a number
of the various microvaxen and sun workstations scattered across
campus.  This duty falls to the DSG Programmer designated as "DOC", or
"DSG On Call", who receives his beeper based on a monthly rotation
schedule.
.sp
In the past, shell scripts running every six hours reported various
system statistics to DOC, who then scanned the output for signs of
possible trouble.  The output of these shell scripts became
overwhelming as the number of machines and potential problems grew;
corresponding to this increase in output was an increase in the amount
of time that DOC had to spend reading this output.  In addition, most
of this output merely indicated normal system operation; potential
problems were buried amongst non-problems.  Because of this, DOC could
often waste a tremendous amount of time wading through system status
reports, time which can be better spent actually fixing system
problems.
.sp
Unix is equipped with many powerful tools for program development, but
none which simply watch the system for signs of trouble.  Programs like
\fIps\fR and \fIdf\fR provide information regarding the current state
of the machine, yet it still remains DOC's responsibility to interpret
this information and assess the health of the system at any given
time.  This deficiency can be rectified by providing the
system with the capacity to determine its own state of health, advising
DOC when it notices a problem which requires DOC's intervention.
.SE 2. Design Goals (devising the solution)
In designing \fIwatcher\fR, the author closely examined just what DOC
does in monitoring the system; just how \fIdoes\fR DOC spot potential
trouble in the DOC reports?  These reports consist of output from \fIdf
-i\fR, \fIruptime\fR, \fIps -aux | sort\fR, and the tail of
\fIcronlog\fR, which usually only changes in the middle of the night.
It was determined that DOC's task consisted primarily of scanning
various numbers in this output, deciding whether or not they had
exceeded an allowable maximum or minimum, or if the values had changed
too much from the last time the command was run, assuming the last
value is even remembered.  Getting a computer to do this is more
complicated than might seem at first glance, due to inconsistencies in
the location of pertinent information between runs of these commands.
For instance, the process occupying the fifth line of \fIps -ax\fR
might next time appear on the eighth line; similarly, \fIuptime\fR does
not consistently put germane information in the same place on the line.
.sp
While flexibility is certainly a primary design consideration, it is
not the whole story.  In order to improve DOC's effectiveness, the
program should run frequently, roughly every two or three hours,
catching problems early (hopefully before they have affected
the users).  Thus, the program should also be as silent as possible
except when it detects a potential problem; any advantage DOC gains in
using \fIwatcher\fR would be eliminated if the program delivered an
exceedingly verbose status report every two hours.  \fIwatcher\fR's
problem reports should be exact and concise, leading DOC immediately to
the trouble.
.sp
The problem of reducing the amount of output DOC must process can be
approached in different ways, including the redesign of the current
shell scripts.  A simple \fIawk\fR script can watch the output from
\fIdf\fR [1].  However, each command would require a custom tailored
\fIawk\fR script to look at it.  This task grows more complicated as the
number of programs running increases.
While a program could be written to
generate these \fIawk\fR scripts, this process is needlessly complex;
for only a bit more work, an efficient C program such as
\fIwatcher\fR can be developed.  
.SE 3. Design (actual implementation of the solution)
Run at intervals specified in \fIcrontab\fR, \fIwatcher\fR parses a
control file (./\fIwatcherfile\fR by default)
with a \fIyacc\fR generated parser, building a data structure
containing all of the information from the file.  The file contains the
list of commands \fIwatcher\fR
should run (the pipeline), output specifications
for each command (the output format), and the guidelines used in
determining if something is amiss and should be reported to DOC (the
change format).  A sample \fIwatcher\fR control file would look
something like this (comment lines begin with a '#'):
.EX
# Here is the pipeline and its alias:
(df -i | /usr/ucb/tail +2) { df }
# the output format; this is a column output format:
	$1-9 device%k $41-42 spaceused%d $64-65 inodesused%d:
# and the change format:
		spaceused 15%;
		spaceused 0 89;
		inodesused 15%;
		inodesused 0 49.

# another command example:
(/usr/ucb/ruptime | fgrep -f UnmHosts) { ruptime }
# this is a relative output format
	2 status%s 1 machine%k 7 loadav%d:
# and another change format:
		loadav 0 10;
		status "up".
.NX
The first entry causes \fIwatcher\fR to run the \fIdf\fR pipeline
listed in parentheses.  When reporting problems, \fIwatcher\fR refers
to this command by the alias provided in the braces; if no alias
appears, \fIwatcher\fR uses the entire pipeline.  
.sp
The output format
instructs \fIwatcher\fR how to parse the output;
column format, indicated in the output format by \fBnum-num\fR,
instructs \fIwatcher\fR that the output should be parsed
by columns, while relative format, denoted by a single integer, shows
that the output should be broken up by whitespaces.
Through the convention \fBname%type\fR, the output format also names each
field, indicating whether the field is numeric, string, or
keyword, specified by \fBd\fR, \fBs\fR, or \fBk\fR respectively.
Keyword fields are
used to match up corresponding output lines between runs.  Thus
.EX
	41-42 spaceused%d
.NX
indicates that this field, named \fBspaceused\fR, contains numeric 
information in columns 41-42, while
.EX
	2 status%s
.NX
informs \fIwatcher\fR that the second word (group of non-whitespace
characters) on the line is a string field named \fBstatus\fR.
For the \fIdf\fR example given above,
.EX
Filesystem    kbytes    used   avail  capacity   iused   ifree  %iused  Mounted on
/dev/hp1f      52431   39763    7424    84%    6937    9447    42%   /develop
.NX
\fBdevice\fR would be \fI/dev/hp1f\fR, \fBspaceused\fR would be 84,
and \fBinodesused\fR would be 42.  Similarly, the output from the
\fIruptime\fR example, which looks like this
.EX
charon        up 26+07:53,    17 users,  load 3.12, 2.90, 2.66
.NX
would be broken at the following places:
.EX
charon | up | 26+07:53, | 17 | users, | load | 3.12, | 2.90, | 2.66,
.NX
assigning "up" to \fBstatus\fR, and 3.12 to \fBloadav\fR.
.sp
The name field also appears in the change format, designating allowable
values for this field to have.  These values can be specified as 
single character strings in the case of string fields; in the case of
numeric fields, the values take the form of either
percentage or absolute changes, or a minimum and maximum which delineate
an acceptable range.
Thus
.EX
	inodesused 15%;
	inodesused 0 49.
.NX
signifies that DOC should be notified if the field named \fBinodesused\fR
increases by more than 15% from the last run, or if it is outside the
range 0 to 49; similarly
.EX
	status "up";
.NX
informs \fIwatcher\fR to notify DOC if the \fBstatus\fR field contains
anything other than the word "up".
.sp
As \fIwatcher\fR parses the output of a pipeline, it stores the
pertinent parts of the output in a history file (by
default, ./\fIwatcher.history\fR).
The next time \fIwatcher\fR runs, it reads this file to provide
comparison values for the command.  If a command is new (i.e. it has no
previously-stored output in the history file), \fIwatcher\fR checks the
fields which require no previous data, such as min-max fields, while
still storing \fIall\fR of the relevant information to the history file.
Thus, the next time the new command is run, it will be an \fIold\fR command,
and meaningful between-run comparisons can be made.
.sp
When \fIwatcher\fR
detects no problems with the system, DOC receives an empty mail message
with the subject "\fIhostname\fR had no problems at \fIdate\fR";
this is to insure that \fImail\fR is running correctly.  
When it notices a problem which should be brought to DOC's attention,
it mails the system problem report in a concise
format, explaining what is wrong and why.  
Thus, rather than the megabytes of shell script output that DOC
used to receive and have to read,
he merely sees this when he reads his mail:
.EX
Mail version 5.2 6/21/85.  Type ? for help.
"/usr/spool/mail/ingham": 5 messages 5 new
 N  1 root@charon.unm Sat Apr 11 16:00  8/212  "charon had no problems at Sat"
 N  2 root@ariel.unm Sat Apr 11 16:00  8/208  "ariel had no problems at Sat "
 N  3 root@geinah.unm Sat Apr 11 16:00  11/417 "System problem report for gei"
 N  4 root@izar.unm Sat Apr 11 16:00  8/204  "izar had no problems at Sat A"
 N  5 root@deimos.unm Sat Apr 11 16:00  8/212  "deimos had no problems at Sat"
.NX
The letters indicating no problems can be immediately deleted, and DOC
can turn his attention to the letter indicating a 
system problems.  A sample problem report
would look something like this:
.EX
df has a max/min value out of range:
/dev/hp0h     140488  111195   15244    91%   10145   28767    26%   /usr
where spaceused = 91.00; valid range 0.00 to 89.00.
Also it had inodesused change by more than 10%.
Previous value 20.00; current value 26.00.
.NX
Note that if a line has more than one indication of a problem, all
anomalies are included in the report.
This provides DOC with as much information as possible, allowing him
to determine the problem quickly and devise 
a rapid fix (hopefully before users know something is amiss).  
.sp
.SE 4. Results (how its helped us)
\fIwatcher\fR's primary advantage lies in the reduction of DOC's work
load.  It has taken over the more menial aspects of monitoring a system,
tasks like reading and comparing numbers, 
giving DOC more time to concentrate on bugs of a nature which
\fIwatcher\fR isn't set up to monitor, such as problems in the
accounting system.
DOC is apprised of potential problems quickly, and in
some cases can repair them in less time than simply
reading the shell script output
would have taken.
.sp
The ability to monitor changes between runs has also helped bring to our
attention some
problems which were missed in the DOC reports.  For example,
disk space on \fI/u2\fR on one of our machines jumped by more than 15%.  Since
this jump did not force the total space used above 90%, at which point
DOC would have investigated the filesystem, it is unlikely
that DOC would have even noticed this sudden change.  The facility to
watch for relative changes between runs enables DOC to catch problems in
their infancy, and fix problems such as filesystems filling up too
rapidly before they inconvenience the users.
.sp
Since the system manager specifies not only the commands \fIwatcher\fR will
execute and the time lapse between successive runs, but also the
parameters which indicate system anomalies,
\fIwatcher\fR can easily be seen as a very flexible, general system
monitor.  Its use at UNM has provided an increase in the
productivity of the system manager, which has led in turn to the
increase in the reliability and availability of the systems at UNMCC.
.SE 5. Availability (how to get one)
\fIwatcher\fR will be sent to the moderator of mod.sources after the
conference is over.
.SE 6. References (you might also find this interesting)
.in +0.5i
.ti -0.5i
[1] Monitoring Free Disk Space, Rik Farrow, Wizard's Grabbag, \fIUnix
World\fR, Vol. IV, no. 3, pp. 86-87.
.in -0.5i
