From pa.dec.com!decwrl!uunet!sparky!kent Mon Aug  5 13:21:57 PDT 1991
Article: 2547 of comp.sources.misc
Newsgroups: comp.sources.misc
Path: pa.dec.com!decwrl!uunet!sparky!kent
From: Robert Steven Glickstein <bobg+@andrew.cmu.edu>
Subject:  v21i045:  msort - A flexible sort in perl, Part01/01
Message-ID: <1991Jul26.214758.3369@sparky.IMD.Sterling.COM>
X-Md4-Signature: d3261e47574247bfdc406cc86ac422d8
Sender: kent@sparky.IMD.Sterling.COM (Kent Landfield)
Organization: Sterling Software, IMD
Date: Fri, 26 Jul 1991 21:47:58 GMT
Approved: kent@sparky.imd.sterling.com
Lines: 1196

Submitted-by: Robert Steven Glickstein <bobg+@andrew.cmu.edu>
Posting-number: Volume 21, Issue 45
Archive-name: msort/part01
Environment: Perl

This is msort, a Perl script for sorting lines in much the same
fashion as the standard Unix sort(1) command.  Msort has the following
additional features:

	- Five selectable sorts -- ascii, bibliographic, numeric,
		by date, and by name (the by-name sort works on
		names in "firstname lastname" order, not
		"lastname, firstname" order)
	- Ability to define new types of sorts (as Perl libraries)
	- Apply different types of sorts to different fields
	- Ability to split very large inputs among temporary files,
		then merge
	- Any regular expression can be the field-separator
	- Any string can be the record separator

Msort is not intended to replace sort(1), as it is too slow.  However,
when you absolutely, positively need something sorted by a
non-standard comparison predicate, or when the input size chokes
sort(1), then msort will see you through.

Be sure to change the DESTDIR line in the Makefile and the #! line in
the perl script to reflect the paths on your system.

This is my first major stab at writing in Perl.  All sorts of feedback
welcome.

----------cut here----------
#! /bin/sh
# This is a shell archive.  Remove anything before this line, then unpack
# it by saving it into a file and typing "sh file".  To overwrite existing
# files, type "sh file -c".  You can also feed this as standard input via
# unshar, or by typing "sh <file", e.g..  If this archive is complete, you
# will see the following message at the end:
#		"End of shell archive."
# Contents:  Makefile msort.1 msort.pl
# Wrapped by bobg@ephrata.andrew.cmu.edu on Fri Jul 26 16:19:06 1991
PATH=/bin:/usr/bin:/usr/ucb ; export PATH
if test -f 'Makefile' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'Makefile'\"
else
echo shar: Extracting \"'Makefile'\" \(304 characters\)
sed "s/^X//" >'Makefile' <<'END_OF_FILE'
DESTDIR = /afs/.andrew.cmu.edu/itc/destdir/campus/contributed/depot/msort
X
all:
X
install: all
X	-mkdir $(DESTDIR) $(DESTDIR)/bin $(DESTDIR)/man $(DESTDIR)/man/man1
X	install -c -m 0755 msort.pl $(DESTDIR)/bin
X	mv $(DESTDIR)/bin/msort.pl $(DESTDIR)/bin/msort
X	install -c -m 0644 msort.1 $(DESTDIR)/man/man1
END_OF_FILE
if test 304 -ne `wc -c <'Makefile'`; then
    echo shar: \"'Makefile'\" unpacked with wrong size!
fi
# end of 'Makefile'
fi
if test -f 'msort.1' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'msort.1'\"
else
echo shar: Extracting \"'msort.1'\" \(13723 characters\)
sed "s/^X//" >'msort.1' <<'END_OF_FILE'
X.TH MSORT 1 "July 23, 1991"
X.UC
X.SH NAME
msort \- a better sort of sort
X.SH SYNOPSIS
X.B msort
X[
X.I options ...
X] [\-\-] [
X.I file
X] ...
X.br
X.SH DESCRIPTION
X.I Msort
sorts its input according to the given options, and writes the result
on the standard output.  It is intended to be a more flexible version
of the standard Unix
X.IR sort(1)
although it is not a replacement; it is written as a
X.IR perl(1)
script and consequently is considerably slower than
X.IR sort(1).
X.I Msort
should be used only for difficult or large sorts which
X.IR sort(1)
cannot handle.
X.PP
The type of sort performed by
X.I msort
is determined by which
X.I comparison operators
are chosen by the user.  A comparison operator compares two strings
and decides which is "smaller" and which is "larger" according to
some criterion.
X.I Msort
includes five selectable comparison operators and permits the user to
add new comparison operators.  The five builtin operators are:
X.TP
X.B ascii
This is the default comparison operator, and orders strings according
to which comes first in ASCII collating sequence.
X.TP
X.B bib
This is a bibliographic comparison operator, which ignores alphabetic
case, non-alphanumeric characters, and the first word of the string if
it is "the," "an," or "a."
X.TP
X.B num
This treats its subject strings as numbers, and places lower numbers
before higher ones.
X.TP
X.B name
This comparison operator attempts to recognize its subject strings as
people's names and sort them by last name, case-insensitively (it also
ignores hyphens and apostrophes in names like
X"O'Shaughnessy\-Milkbone").  It attempts to do the right thing in
the presence of various affixes like "Mrs." or "Jr.", so that a name
like "Dr. Warren X. Yablonski, III" will, for the purposes of the
comparison, be transformed into "yablonski warren x iii".  It should
even correctly handle names that have gone affix-crazy, such as
X "The Rt. Hon. Rev. Dr. Gen. Sir Percy Schmercy, Sr., Esq."
X.TP
X.B date
This comparison operator treats its subject strings as dates, and
places earlier dates before later ones.  It attempts to understand
many different styles of date-writing, and can sometimes even distinguish
between the American "month/day/year" and the European "day/month/year".
X.PP
You may define your own comparison operators by writing a small
X.IR perl(1)
file (see below).
X.PP
Once a comparison operator is chosen, it must be applied to a portion
of the input lines.  If each line of the input has the form "Name,
Date, Comments" and you want to sort the input by date, then you
would choose the "date" comparison operator and apply it to the second
field (the first and third fields do not contain date information).
By default, fields are separated from each other by whitespace;
however, you can designate any regular expression as being a
field-separating pattern (see below).  If you do not specify any
fields as input to the comparison operator, then the entire line is
used (actually, the
X.I concatenation
of all the fields in the line is used; see the discussion of the \-mf
option, below).  It is possible to specify a sequence of fields to be
used as input to a single comparison.
X.PP
You may specify a sequence of fields or comparison operators to use in
sorting the input.  Subsequent operators or fields designated become
minor sorts, in decreasing order of significance.  (If an earlier
comparison determines two lines are equal, later comparisons will be
tried in turn.)  Thus, if my input lines look like "Name, Age,
Occupation," I may want a "name" sort on the first field, and when two
lines have the same name, I want the younger one listed first, so I
would specify a secondary "num" sort on the second field.
X.PP
Any time you specify a sort, you may indicate whether to reverse the
ordering and whether alphabetic case is significant.
X.PP
On very large sets of data,
X.I msort
will split its input into "chunks," sort them individually, store the
intermediate results in temporary files, and merge the temporary files
together to produce the sorted output.  Thus it is difficult to
exceed
X.I msort's
sorting capacity.  It is easy, however, to exceed your patience
waiting for
X.I msort
to do its job.  My best advice is that you watch
X.I msort's
progress using one or more \-v flags (see below), and if it's too slow
for you, drink some coffee, read some bboards, or go shopping.
X.SH OPTIONS
X.TP
X.B \-v
Verbose.  This tells
X.I msort
to perform its work verbosely.  Status messages will be printed to the
diagnostic output (stderr).  Additional v's produce greater verbosity,
although a single one should give you more than enough running
commentary.
X.TP
X.B \-a
Analyze.
X.I Msort
will attempt to prescan its input to determine the amount of data it
will have to deal with.  It also tries to determine the maximum number
of open files per process on your system.  Once it has done this, it
tries to calculate an optimum "chunk" size, favoring several small
chunks over few large chunks.
X.I Msort
cannot prescan the standard input.
X.TP
X.B \-mc \fInum\fR
Maximum number of chunks permitted.  The input will be split into no
more than
X.I num
chunks.  This option only works if you also specify the \-a flag.  By
default, the maximum number of chunks is slightly less than the
system's per-process file descriptor limit.  Overrides the
MSORTCHUNKS environment variable.
X.TP
X.B \-I \fIdir\fR
Place directory
X.I dir
on the search path, which is the list of directories in which to
search for Perl libraries to be
included with the \-l flag (see below).  \-L is a synonym for \-I.
X.TP
X.B \-l \fIlib\fR
Include Perl file
X.I lib
which can be found in one of the directories on the search path (see
X\-I).  This is for including Perl code that implements additonal
comparison operators.
X.TP
X.B \-s \fIsortname\fR
Select a type of sort, or comparison operator.  The five built-in
values for
X.I sortname
are "ascii," "bib," "num," "name," and "date."
X.I Sortname
may also be the name of a comparison operator included in a library
with the \-l flag.  The \-s flag also resets the behavior of the \-r
and the \-c flags (see below).
X.TP
X.B \-r
Reverse the sense of the current sort.  Reverses the sense of
comparisons for all fields specified up until, but not including, the
next use of the \-s flag (after which, if reversed ordering is again
sought, \-r must be used again).
X.TP
X.B -c, +c
Turn case-sensitivity off or on, respectively.  Like \-r, this flag
affects all comparisons up until, but not including, the next use of
the \-s flag, at which time the default behavior (case is significant)
is restored.  Note that, as far as built-in comparison operators go,
X\-c only has an effect in the "ascii" operator; all the others are
always case-insensitive.  User-defined comparison operators may make
use of the $CaseMatters
X.IR perl(1)
variable to determine whether \-c was used.
X.TP
X.B [+\-]\fIfieldnum\fR[,\fInumfields\fR]
Sort on field
X.I fieldnum,
continuing for
X.I numfields
fields, according to the most recently selected comparison operator
X("ascii" by default).
If a \-r or a \-c or +c has appeared since the
most recent \-s, then this sort will be performed in reverse,
case-insensitively, or case-sensitively, respectively.
X.br
If the form +\fIfieldnum\fR[,\fInumfields\fR] is used, then
X.I fieldnum
refers to the \fIfieldnum\fRth field from the left (where the leftmost
field is +0).
If the form \-\fIfieldnum\fR[,\fInumfields\fR] is used, then
X.I fieldnum
refers to the \fIfieldnum\fRth field from the right (where the
rightmost field is \-0).
X.br
By default,
X.I numfields
is 1.  If it is 0, then all fields from
X.I fieldnum
through the end of the line will be used.  Whenever more than one
field is used, the fields involved are concatenated together,
separated by the string specified with the -mf (multi-field) flag (see
below).  By default, this is the empty string.  Thus, using the
defaults (input fields separated by whitespace, multi-field joins
separated by nothing), a field specifier of "+2,3" applied to the line
X.br
X"Now is the winter of our discontent"
X.br
will cause the string "thewinterof" to be sent to the comparison operator.
X.TP
X.B -f \fIregexp\fR
Specify the field-separating regular expression, in
X.IR perl(1)
regular-expression syntax.  Slashes (/) in the expression must be
escaped with a backslash (\\), as must all other regular expression
metacharacters which are meant to be taken literally.
The default field-separator is '\\s+'
X(split into fields on whitespace).
X.TP
X.B -mf \fIstring\fR
Multi-field separator.  This is the separator that will be used when a
string is constructed from a sequence of fields, as with the
X[\-+]\fInum\fR,\fInum\fR construct (see above).
X.TP
X.B -irs \fIstring\fR
Input record separator.  By default, an input record is a line, and
lines are separated by a newline character.  However, you may specify
any string as the input record separator.  Note that, unlike \-f, this
option does not allow you to specify a regular expression.
X.TP
X.B -mb \fIbytes\fR
Limit the size of a chunk to
X.I bytes
bytes.  By default this is 8192.  A chunk is "full" when this limit is
reached,
X.I or
when the maximum-lines limit is reached, whichever comes first.  Use
X\-a instead of \-mb or \-ml if you want
X.I msort
to automagically determine suitable limits.  Overrides the
MSORTCHUNKBYTES environment variable.
X.TP
X.B \-ml \fIlines\fR
Limit the size of a chunk to
X.I lines
lines.  By default this is 256.  A chunk is "full" when this limit is
reached,
X.I or
when the maximum-bytes limit is reached, whichever comes first.  Use
X\-a instead of \-mb or \-ml if you want
X.I msort
to automagically determine suitable limits.  Overrides the
MSORTCHUNKLINES environment variable.
X.TP
X.B \-t \fIdir\fR
By default,
X.I msort
places its temporary files in /tmp.  Use this flag to specify an
alternate directory.  Overrides the MSORTTMPDIR environment variable.
X.TP
X.B \fIfile\fR ...
Names files from which to read input.  If no files are specified,
standard input is read.  If more than one file is specified, the
contents of all files named are sorted together.  The filename '\-'
refers to the standard input.
X.SH EXAMPLES
To apply a name sort to the first field, and a bibliographic sort to
all remaining fields:
X.br
X	msort -sname +0 -sbib +1,0 \fIfile\fR
X.br
To sort in reverse numeric order on the last field:
X.br
X	msort -snum -r -0 \fIfile\fR
X.br
To sort /etc/passwd according to users' last names (with a secondary
numeric sort on uid numbers):
X.br
X	msort -f: -sname +4 -snum +2 /etc/passwd
X.br
To treat the last three fields as a month, a day, and a year, and to
sort by date on the combination of these three fields:
X.br
X	msort -sdate -mf/ -2,0 \fIfile\fR
X.br
To apply the "foobar" sort, which you've defined in a file called
foobarlib:
X.br
X	msort -lfoobarlib -sfoobar \fIfile\fR
X.br
X.SH WRITING COMPARISON OPERATORS
To write an
X.I msort
comparison operator, you need to write a
X.IR perl(1)
subroutine.  You should create your own "package" so that your names
do not conflict with those in the
X.I msort
script.  However, the name of the subroutine should be in package "main."
The subroutine is passed two arguments in the @_ array, and should
return an integer less than, equal to, or greater than zero according
as the first argument is to be considered "smaller," "equal," or
X"larger" than the second.  The global variable $CaseMatters will be
true if your subroutine should treat upper- and lower-case as
distinct, and will be false otherwise.  If the user has selected a
X"reversed" sort with the \-r flag, your subroutine needn't do
anything; the arguments will simply be passed in in reverse order.
X.PP
Here is a sample package which implements a compare-by-length
subroutine.  According to this comparison operator, shorter strings come
before longer strings:
X.PP
X.nf
X	package bylength;
X
X	sub main'bylength {
X		length($_[0]) <=> length($_[1]);
X	}
X	1;
X.fi
X.SH ENVIRONMENT VARIABLES
X.TP
X.B MSORTCHUNKBYTES
The maximum number of bytes per chunk (like the \-mb flag).  This
value is ignored if \-a is specified.
X.TP
X.B MSORTCHUNKLINES
The maximum number of lines per chunk (like the \-ml flag).  This
value is ignored if \-a is specified.
X.TP
X.B MSORTCHUNKS
The maximum number of chunks (like the \-mc flag).  This value is
ignored
X.I unless
the \-a flag is specified.
X.TP
X.B MSORTTMPDIR
The directory in which to place temporary files, /tmp by default.
Works like the \-t flag.
X.TP
X.B MSORTPATH
A colon-separated list of directories in which to search for
X.IR perl(1)
files to load (either from the MSORTLIBS environment variable or with
the \-l flag).  Directories specified with \-I or \-L are appended to
the list of
directories specified with MSORTPATH.
X.TP
X.B MSORTLIBS
Colon-separated list of filenames.  These are taken to be
X.IR perl(1)
libraries to load before the sort begins.  Files specified with the
X\-l flag are loaded after the ones listed in MSORTLIBS.
X.SH FILES
X.TP
X.B /usr/include/sys/limits.h
One place where the definition of the per-process file-descriptor
limit might be found.
X.TP
X.B /usr/include/sys/param.h
Another place.
X.SH DIAGNOSTICS
X.TP
X.B cannot analyze standard input
The \-a flag was specified, and either no filenames were given on the
command line, or the special "\-" filename was given.
X.TP
X.B received signal...
A program-terminating signal was trapped and intermediate files are
being removed before
X.I msort
exits.
X.SH AUTHOR
Bob Glickstein, Information Technology Center, Carnegie Mellon
University
X.br
bobg+@andrew.cmu.edu
X.SH BUGS
There is no guarantee that you won't run out of file descriptors, even
if you use \-a or \-mc.
X.br
It's slow as molasses.
X.br
The mechanism for specifying a multi-field sort is crufty.
X.SH SEE ALSO
X.IR sort(1),
X.IR perl(1)
END_OF_FILE
if test 13723 -ne `wc -c <'msort.1'`; then
    echo shar: \"'msort.1'\" unpacked with wrong size!
fi
# end of 'msort.1'
fi
if test -f 'msort.pl' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'msort.pl'\"
else
echo shar: Extracting \"'msort.pl'\" \(15551 characters\)
sed "s/^X//" >'msort.pl' <<'END_OF_FILE'
X#!/usr/contributed/bin/perl
X
X### MSORT
X### A Better Sort of Sort
X### (c) 1991 by Bob Glickstein
X
X@_ = split(/\/+/, $0);
X$ProgramName = pop(@_);
X$Version = '1.0';
X
X$SIG{'HUP'} = 'interrupthandler';
X$SIG{'INT'} = 'interrupthandler';
X$SIG{'QUIT'} = 'interrupthandler';
X$SIG{'PIPE'} = 'interrupthandler';
X$SIG{'TERM'} = 'interrupthandler';
X
X$[ = 0;
X
X$Analyze = 0;
X$MaxChunkBytes = ($ENV{'MSORTCHUNKBYTES'} || 8192);
X$MaxChunkLines = ($ENV{'MSORTCHUNKLINES'} || 256);
X$MaxChunks = $ENV{'MSORTCHUNKS'};
X$FieldSepRegexp = '\s+';
X$MultiFieldSep = '';
X
X@SortList = ();
X
X$ChosenSort = 'ascii';
X$Reverse = 0;
X$CaseMatters = 1;
X
X$TmpDir = ($ENV{'MSORTTMPDIR'} || '/tmp');
X
X$Verbose = 0;
X
X$" = '';
X
X@MsortPath = ();
X@OrigINC = @INC;
X
if ($_ = $ENV{'MSORTPATH'}) {
X    @MsortPath = split(/:/, $_);
X    @INC = (@MsortPath, @OrigINC);
X}
X
if ($_ = $ENV{'MSORTLIBS'}) {
X    @libs = split(/:/, $_);
X    foreach $lib (@libs) {
X	require $lib || require "$lib.pl"
X	    || die "Cannot preload library $libname ($!)\n";
X    }
X}
X
X$Usage  = "[options] [--] [file] ...\n";
X$Usage .= "Options are:\n";
X$Usage .= "    -v[v[v...]]         Verbose\n";
X$Usage .= "    -a                  Analyze input\n";
X$Usage .= "    -mc chunks          Max chunks (with -a)\n";
X$Usage .= "    -I dir (or -L dir)  Add dir to library search path\n";
X$Usage .= "    -l lib              Load library lib\n";
X$Usage .= "    -s sortname         Select a sort (ascii,bib,num,name,date)\n";
X$Usage .= "    -r                  Reverse sort\n";
X$Usage .= "    -c, +c              Case-sensitivity off/on\n";
X$Usage .= "    [-+]num1[,num2]     Sort on num2 fields starting at num1\n";
X$Usage .= "    -f regexp           Field-separator\n";
X$Usage .= "    -mf string          Multi-field join delimiter\n";
X$Usage .= "    -irs string         Input record separator\n";
X$Usage .= "    -mb num             Max bytes per chunk\n";
X$Usage .= "    -ml num             Max lines per chunk\n";
X$Usage .= "    -t dir              Directory for temp files";
X
X$chosesort = 0;
X$chosefield = 0;
while ($_ = $ARGV[0], /^[-+]/) {
X    shift;
X    last if /^--$/;
X    if (/^-(v+)$/) {
X	$Verbose += length($1); 
X    } elsif (/^-a$/) {
X	$Analyze = 1;
X    } elsif (/^-[IL](.*)$/) {
X	push(@MsortPath, ($1 || shift));
X	@INC = (@MsortPath, @OrigINC);
X    } elsif (/^-l(.*)$/) {
X	$libname = ($1 || shift);
X	require $libname || require "$libname.pl"
X	    || die "Cannot load library $libname ($!)\n";
X    } elsif (/^-s(.*)$/) {
X	if ($chosesort && !$chosefield) {
X	    push(@SortList, 0, 0, 0, $ChosenSort, $Reverse, $CaseMatters);
X	}
X	$ChosenSort = ($1 || shift);
X	$Reverse = 0;
X	$CaseMatters = 1;
X	$chosesort = 1;
X	$chosefield = 0;
X	next;
X    } elsif (/^-r$/) {
X	$Reverse = 1 - $Reverse;
X    } elsif (/^-f(.*)$/) {
X	$FieldSepRegexp = ($1 || shift);
X    } elsif (/^-mf(.*)$/) {
X	$MultiFieldSep = ($1 || shift);
X    } elsif (/^-irs(.*)$/) {
X	$/ = ($1 || shift);
X    } elsif (/^-mb(.*)$/) {
X	$MaxChunkBytes = ($1 || shift);
X    } elsif (/^-ml(.*)$/) {
X	$MaxChunkLines = ($1 || shift);
X    } elsif (/^-mc(.*)$/) {
X	$MaxChunks = ($1 || shift);
X    } elsif (/^-t(.*)$/) {
X	$TmpDir = ($1 || shift);
X    } elsif (/^-c$/) {
X	$CaseMatters = 0;
X    } elsif (/^\+c$/) {
X	$CaseMatters = 1;
X    } elsif (/^-([0-9]+)(,([0-9]+))?$/) {
X	$chosefield = 1;
X	push(@SortList,
X	     1, $1, ($3 || 1), $ChosenSort, $Reverse, $CaseMatters);
X    } elsif (/^\+([0-9]+)(,([0-9]+))?$/) {
X	$chosefield = 1;
X	push(@SortList,
X	     0, $1, ($3 || 1), $ChosenSort, $Reverse, $CaseMatters);
X    } elsif (/^-V$/) {
X	print "$ProgramName version $Version\n";
X	exit 0;
X    } elsif (/^-h(elp)?$/) {
X	print "Usage: $ProgramName $Usage\n";
X	exit 0;
X    } else {
X	die "Unknown option '$_'\nUsage: $ProgramName $Usage\n";
X    }
X}
X
ANALYSIS: {
if ($Analyze) {
X    if (@ARGV) {
X	warn "[analyzing size of input]\n" if $Verbose;
X	$lines = 0;
X	$bytes = 0;
X	foreach $file (@ARGV) {
X	    warn "[analyzing $file...]\n" if $Verbose > 2;
X	    if ($file eq '-') {
X		warn "$ProgramName: cannot analyze standard input\n";
X		last ANALYSIS;	# Yuk yuk yuk
X	    }
X	    open(_, "<$file") || die "Cannot open input file $file ($!)\n";
X	    while (<_>) {
X		++ $lines;
X		$bytes += length($_);
X	    }
X	    close(_);
X	    warn "[analysis so far: $lines lines, $bytes bytes]\n"
X		if $Verbose > 1;
X	}
X	$maxfiles = &MaxDescriptors;
X	$maxfiles = $MaxChunks if ($MaxChunks && ($MaxChunks < $maxfiles));
X	$maxfiles -= 10;
X	$MaxChunkBytes = $bytes / $maxfiles;
X	$MaxChunkLines = $lines / $maxfiles;
X	$MaxChunkBytes = 1024 if $MaxChunkBytes <1024;
X	$MaxChunkLines = 32 if $MaxChunkLines < 32;
X	warn "[analysis complete, chose $MaxChunkLines/$MaxChunkBytes]\n"
X	    if $Verbose;
X    } else {
X	warn "$ProgramName: cannot analyze standard input\n";
X    }
X}
X}	
X
X@NamePrefixes = (
X		 'adm',
X		 'admiral',
X		 'ambassador',
X		 'archbishop',
X		 'archduke',
X		 'bishop',
X		 'capt',
X		 'captain',
X		 'cardinal',
X		 'cdr',
X		 'chief',
X		 'cmdr',
X		 'commander',
X		 'col',
X		 'colonel',
X		 'deputy',
X		 'det',
X		 'detective',
X		 'director',
X		 'dr',
X		 'duke',
X		 'eminence',
X		 'father',
X		 'fraulein',
X		 'friar',
X		 'gen',
X		 'general',
X		 'gov',
X		 'governor',
X		 'her',
X		 'herr',
X		 'highness',
X		 'his',
X		 'holy',
X		 'hon',
X		 'honorable',
X		 'judge',
X		 'lady',
X		 'lieutenant',
X		 'lord',
X		 'lt',
X		 'madame',
X		 'mademoiselle',
X		 'maj',
X		 'majesty',
X		 'major',
X		 'mayor',
X		 'miss',
X		 'mme',
X		 'mmle',	# Is this the right abbrev for mademoiselle?
X		 'monsieur',
X		 'monsignor',	# What's the abbreviation for this?
X		 'mr',
X		 'mrs',
X		 'ms',
X		 'officer',
X		 'pres',
X		 'president',
X		 'private',
X		 'pvt',
X		 'rabbi',
X		 'rear',
X		 'rev',
X		 'reverend',
X		 'royal',
X		 'senator',
X		 'senor',
X		 'senora',
X		 'senorita',
X		 'sergeant',
X		 'sgt',
X		 'signor',
X		 'signora',
X		 'signorina',
X		 'sir',
X		 'sr',
X		 'sra',
X		 'the'
X		 );
X
X@NameSuffixes = (
X		 'b',
X		 'd',
X		 'dds',
X		 'e',
X		 'esq',
X		 'esquire',
X		 'ii',
X		 'iii',
X		 'iv',
X		 'ix',
X		 'jr',
X		 'm',
X		 'md',
X		 'o',
X		 'obe',
X		 'ph',
X		 'phd',
X		 's',
X		 'sr',
X		 'the',
X		 'v',
X		 'vi',
X		 'vii',
X		 'viii',
X		 'x',
X		 'xi',
X		 'xii',
X		 'xiii',
X		 'xiv',
X		 'xv',
X		 'xvi',
X		 'xvii',
X		 'xviii'
X		 );
X
X
X@ChunkFiles = ();
X
X@chunk = ();
X$chunkbytes = 0;
X
while (<>) {
X    push(@chunk, $_);
X    $chunkbytes += length($_);
X    if (($chunkbytes >= $MaxChunkBytes)
X	|| (@chunk >= $MaxChunkLines)) {
X	printf(STDERR "[sorting chunk %d (%d/%d)]\n",
X	       1 + @ChunkFiles, 0 + @chunk, $chunkbytes) if $Verbose;
X	@chunk = sort SortPredicate @chunk;
X	$chunkfile = sprintf("$TmpDir/mschunk.$$.%d", 1 + @ChunkFiles);
X	open(_, ">$chunkfile") ||
X	    die "Cannot open temp file $chunkfile ($!)\n";
X	print _ @chunk;
X	close _;
X	printf(STDERR "[stored chunk %d in %s]\n", 1 + @ChunkFiles, $chunkfile)
X	    if $Verbose > 1;
X	push(@ChunkFiles, $chunkfile);
X	$chunkbytes = 0;
X	@chunk = ();
X    }
X}
if (@chunk) {
X    printf(STDERR "[sorting chunk %d (%d/%d)]\n",
X	   1 + @ChunkFiles, 0 + @chunk, $chunkbytes) if $Verbose;
X    @chunk = sort SortPredicate @chunk;
X    unless (@ChunkFiles) {
X	print @chunk, "\n";
X	exit 0;
X    }
X    $chunkfile = sprintf("$TmpDir/mschunk.$$.%d", 1 + @ChunkFiles);
X    open(_, ">$chunkfile") ||
X	die "Cannot open temp file $chunkfile ($!)\n";
X    print _ @chunk;
X    close _;
X    printf(STDERR "[stored chunk %d in %s]\n", 1 + @ChunkFiles, $chunkfile)
X	if $Verbose > 1;
X    push(@ChunkFiles, $chunkfile);
X    @chunk = ();
X} elsif (@ChunkFiles == 1) {
X    open(<_>, "<$ChunkFiles[0]")
X	|| die "Cannot read temp file $ChunkFiles[0] ($!)\n";
X    while (<_>) {
X	print;
X    }
X    close _;
X    unlink $ChunkFiles[0];
X    exit 0;
X}
X
X@ChunkHandles = ();
X
X$i = 1;
foreach $chunkfile (@ChunkFiles) {
X    $handlename = "CHUNK$i";
X    ++ $i;
X    open($handlename, "<$chunkfile")
X	|| die "Cannot read temp file $chunkfile ($!)\n";
X    push(@ChunkHandles, $handlename);
X}
X
X@Heads = ();
X
foreach $chunkhandle (@ChunkHandles) {
X    $_ = <$chunkhandle>;
X    push(@Heads, $_);
X}
X
while (@Heads) {
X    $best = $Heads[0];
X    $besti = 0;
X    foreach $i (1 .. $#Heads) {
X	if (&ComparePredicate($Heads[$i], $best) < 0) {
X	    $best = $Heads[$i];
X	    $besti = $i;
X	}
X    }
X    print $best;
X    $chunkhandle = $ChunkHandles[$besti];
X    $_ = <$chunkhandle>;
X    if ($_) {
X	splice(@Heads, $besti, 1, $_);
X    } else {
X	splice(@Heads, $besti, 1);
X	close($ChunkHandles[$besti]);
X	splice(@ChunkHandles, $besti, 1);
X	unlink $ChunkFiles[$besti];
X	splice(@ChunkFiles, $besti, 1);
X    }
X}
X
exit 0;
X
X##############################################################################
X
sub interrupthandler {
X    local($sig) = @_;
X
X    warn "$ProgramName: received signal $sig, cleaning up temp files...\n";
X    foreach (@ChunkFiles) {
X	unlink;
X    }
X    exit 1;
X}
X
sub SortPredicate {
X    &ComparePredicate($a, $b);
X}
X
sub ComparePredicate {
X    local($x, $y) = @_;
X    local(@x, @y);
X    local(@sorts) = @SortList;
X    local($xfield, $yfield);
X    local($xfrom, $yfrom);
X    local($result);
X
X    if (@sorts) {
X	@x = split(/$FieldSepRegexp/o, $x);
X	@y = split(/$FieldSepRegexp/o, $y);
X	while (@sorts) {
X	    ($fromend, $fieldnum, $numfields,
X	     $sortname, $Reverse, $CaseMatters) = @sorts;
X	    splice(@sorts, 0, 6);
X	    $xfrom = ($fromend ? ($#x - $fieldnum) : $fieldnum);
X	    $xfield = join($MultiFieldSep,
X			   @x[$xfrom .. ($numfields ?
X					 ($xfrom + $numfields - 1) :
X					 $#x)]);
X	    $yfrom = ($fromend ? ($#y - $fieldnum) : $fieldnum);
X	    $yfield = join($MultiFieldSep,
X			   @y[$yfrom .. ($numfields ?
X					 ($yfrom + $numfields - 1) :
X					 $#y)]);
X	    $result = $Reverse ?
X		&$sortname($yfield, $xfield) : &$sortname($xfield, $yfield);
X	    return $result if $result;
X	}
X	@x = @x[1 + ($fromend ? ($#x - $fieldnum) : $fieldnum) .. $#x];
X	@y = @y[1 + ($fromend ? ($#y - $fieldnum) : $fieldnum) .. $#y];
X	while (@x && @y) {
X	    $result = $Reverse ?
X		&$sortname($y[0], $x[0]) : &$sortname($x[0], $y[0]);
X	    return $result if $result;
X	    shift @x;
X	    shift @y;
X	}
X	if (@x) {
X	    return -1 if $Reverse;
X	    return 1;
X	}
X	if (@y) {
X	    return 1 if $Reverse;
X	    return -1;
X	}
X	return 0;
X    } else {
X	return &$ChosenSort($y, $x) if $Reverse;
X	return &$ChosenSort($x, $y);
X    }
X}
X
sub ascii {
X    local($x, $y) = @_;
X
X    return $x cmp $y if $CaseMatters;
X    $x =~ tr/A-Z/a-z/;
X    $y =~ tr/A-Z/a-z/;
X    $x cmp $y;
X}
X
sub num {
X    local($x, $y) = @_;
X
X    ($x < $y) ? -1 : (($y < $x) ? 1 : 0);
X}
X
sub bib {
X    local($x, $y) = @_;
X    local(@x_words,@y_words);
X
X    $x =~ tr/A-Za-z0-9 \t//cd;
X    $x =~ tr/A-Z/a-z/;
X    $y =~ tr/A-Za-z0-9 \t//cd;
X    $y =~ tr/A-Z/a-z/;
X
X    @x_words = split(/\s+/, $x);
X    @y_words = split(/\s+/, $y);
X
X    shift @x_words if (($x_words[0] eq 'the')
X		       || ($x_words[0] eq 'a')
X		       || ($x_words[0] eq 'an'));
X    shift @y_words if (($y_words[0] eq 'the')
X		       || ($y_words[0] eq 'a')
X		       || ($y_words[0] eq 'an'));
X    join('', @x_words) cmp join('', @y_words);
X}
X
sub name {
X    local($x, $y) = @_;
X    local(@x, @y);
X    local(@xprefixes) = ();
X    local(@xsuffixes) = ();
X    local(@yprefixes) = ();
X    local(@ysuffixes) = ();
X
X    $x =~ tr/.,/ /;
X    $y =~ tr/.,/ /;
X    $x =~ tr/A-Za-z \t//cd;
X    $x =~ tr/A-Z/a-z/;
X    $y =~ tr/A-Za-z \t//cd;
X    $y =~ tr/A-Z/a-z/;
X
X    @x = split(/\s+/, $x);
X    @y = split(/\s+/, $y);
X
X    while ((@x >= 3)
X	   && &AnyMatch($x[0], @NamePrefixes)) {
X	push(@xprefixes, shift(@x));
X    }
X    while ((@x >= 3)
X	   && &AnyMatch($x[$#x], @NameSuffixes)) {
X	unshift(@xsuffixes, pop(@x));
X    }
X    unshift(@x, pop(@x));
X    while ((@y >= 3)
X	   && &AnyMatch($y[0], @NamePrefixes)) {
X	push(@yprefixes, shift(@y));
X    }
X    while ((@y >= 3)
X	   && &AnyMatch($y[$#y], @NameSuffixes)) {
X	unshift(@xsuffixes, pop(@y));
X    }
X    unshift(@y, pop(@y));
X
X    join(' ', @x, @xsuffixes) cmp join(' ', @y, @ysuffixes);
X}
X
sub date {
X    local($x, $y) = @_;
X    local($xday, $xmonth, $xyear);
X    local($yday, $ymonth, $yyear);
X
X    $_ = $x;
X    s/^\s+//;
X    s/\s+$//;
X    tr/A-Z/a-z/;
X    if (/([0-9]+)[^0-9]+([0-9]+)[^0-9]+([0-9]+)/) {
X	if ($1 > 31) {
X	    $xday = $3;
X	    $xmonth = $2;
X	    $xyear = $1;
X	} elsif ($1 > 12) {
X	    $xday = $1;
X	    $xmonth = $2;
X	    $xyear = $3;
X	} else {
X	    $xday = $2;
X	    $xmonth = $1;
X	    $xyear = $3;
X	}
X    } elsif (/([a-z]+)[^0-9]*([0-9]+)[^0-9]*([0-9]+)/) {
X	$xday = $2;
X	$xmonth = $1;
X	$xyear = $3;
X    } elsif (/([0-9]+)[^0-9a-z]*([a-z]+)[^0-9]*([0-9]+)/) {
X	$xday = $1;
X	$xmonth = $2;
X	$xyear = $3;
X    } elsif (/([0-9]+)[^0-9]+([0-9]+)/) {
X	$xmonth = $1;
X	$xyear = $2;
X    } elsif (/([a-z]+)[^0-9a-z]*([0-9]+)/) {
X	$xmonth = $1;
X	$xyear = $2;
X    } elsif (/([0-9]+)/) {
X	$xyear = $1;
X    } else {
X	$xday = 0;
X	$xmonth = 0;
X	$xyear = 0;
X    }
X    if ($xyear < 100) {
X	$xyear += 1900;
X    }
X  xm: {
X      if ($xmonth =~ /[a-z]/) {
X	  $_ = $xmonth;
X	  /^ja/ && ($xmonth = 1, last xm);
X	  /^f/ && ($xmonth = 2, last xm);
X	  /^mar/ && ($xmonth = 3, last xm);
X	  /^ap/ && ($xmonth = 4, last xm);
X	  /^may/ && ($xmonth = 5, last xm);
X	  /^jun/ && ($xmonth = 6, last xm);
X	  /^jul/ && ($xmonth = 7, last xm);
X	  /^au/ && ($xmonth = 8, last xm);
X	  /^s/ && ($xmonth = 9, last xm);
X	  /^o/ && ($xmonth = 10, last xm);
X	  /^n/ && ($xmonth = 11, last xm);
X	  /^d/ && ($xmonth = 12, last xm);
X	  $xmonth = 0;
X      }
X  }
X    $_ = $y;
X    s/^\s+//;
X    s/\s+$//;
X    tr/A-Z/a-z/;
X    if (/([0-9]+)[^0-9]+([0-9]+)[^0-9]+([0-9]+)/) {
X	if ($1 > 31) {
X	    $yday = $3;
X	    $ymonth = $2;
X	    $yyear = $1;
X	} elsif ($1 > 12) {
X	    $yday = $1;
X	    $ymonth = $2;
X	    $yyear = $3;
X	} else {
X	    $yday = $2;
X	    $ymonth = $1;
X	    $yyear = $3;
X	}
X    } elsif (/([a-z]+)[^0-9]*([0-9]+)[^0-9]*([0-9]+)/) {
X	$yday = $2;
X	$ymonth = $1;
X	$yyear = $3;
X    } elsif (/([0-9]+)[^0-9a-z]*([a-z]+)[^0-9]*([0-9]+)/) {
X	$yday = $1;
X	$ymonth = $2;
X	$yyear = $3;
X    } elsif (/([0-9]+)[^0-9]+([0-9]+)/) {
X	$ymonth = $1;
X	$yyear = $2;
X    } elsif (/([a-z]+)[^0-9a-z]*([0-9]+)/) {
X	$ymonth = $1;
X	$yyear = $2;
X    } elsif (/([0-9]+)/) {
X	$yyear = $1;
X    } else {
X	$yday = 0;
X	$ymonth = 0;
X	$yyear = 0;
X    }
X    if ($yyear < 100) {
X	$yyear += 1900;
X    }
X  ym: {
X      if ($ymonth =~ /[a-z]/) {
X	  $_ = $ymonth;
X	  /^ja/ && ($ymonth = 1, last ym);
X	  /^f/ && ($ymonth = 2, last ym);
X	  /^mar/ && ($ymonth = 3, last ym);
X	  /^ap/ && ($ymonth = 4, last ym);
X	  /^may/ && ($ymonth = 5, last ym);
X	  /^jun/ && ($ymonth = 6, last ym);
X	  /^jul/ && ($ymonth = 7, last ym);
X	  /^au/ && ($ymonth = 8, last ym);
X	  /^s/ && ($ymonth = 9, last ym);
X	  /^o/ && ($ymonth = 10, last ym);
X	  /^n/ && ($ymonth = 11, last ym);
X	  /^d/ && ($ymonth = 12, last ym);
X	  $ymonth = 0;
X      }
X  }
X    ($xyear - $yyear) || ($xmonth - $ymonth) || ($xday - $yday);
X}
X
sub AnyMatch {
X    local($str, @strs) = @_;
X
X    foreach (@strs) {
X	($str eq $_) && return 1;
X    }
X    0;
X}
X
sub MaxDescriptors {
X    local($result) = '';
X
X    warn "[trying to determine per-process file-descriptor limit...]\n"
X	if $Verbose > 1;
X    warn "[looking for OPEN_MAX in /usr/include/sys/limits.h]\n"
X	if $Verbose > 2;
X    if (open(_, "/usr/include/sys/limits.h")) {
X	while (<_>) {
X	    if (/^#\s*define\s+OPEN_MAX\s+[^0-9]*([0-9]+)/) {
X		$result = $1;
X		close(_);
X		warn "[...looks like it's $result]\n"
X		if $Verbose > 1;
X		return ($result);
X	    }
X	}
X	close(_);
X    }
X    warn "[looking for NOFILE/NOFILES in /usr/include/sys/param.h]\n"
X	if $Verbose > 2;
X    if (open(_, "/usr/include/sys/param.h")) {
X	while (<_>) {
X	    if (/^\#\s*define\s+NOFILES?\s+[^0-9]*([0-9]+)/) {
X		$result = $1;
X		close(_);
X		warn "[...looks like it's $result]\n"
X		    if $Verbose > 1;
X		return ($result);
X	    }
X	}
X	close(_);
X    }
X    warn "[...guessing it's 16]\n" if $Verbose > 1;
X    16;
X}
END_OF_FILE
if test 15551 -ne `wc -c <'msort.pl'`; then
    echo shar: \"'msort.pl'\" unpacked with wrong size!
fi
chmod +x 'msort.pl'
# end of 'msort.pl'
fi
echo shar: End of shell archive.
exit 0

______________                  _____________________________
Bob Glickstein                | Internet: bobg@andrew.cmu.edu
Information Technology Center | Bitnet:   bobg%andrew@cmuccvma.bitnet
Carnegie Mellon University    | UUCP:     ...!harvard!andrew.cmu.edu!bobg
Pittsburgh, PA  15213-3890    |
(412) 268-6743                | Sinners can repent, but stupid is forever

exit 0 # Just in case...


