A SURAnet guide the Archie service (v1.0.1) by Eric Anderson This document may be converted to another format for use on different machines, and aggregated with other files for distribution. But, please do not modify the content; sections of this document should not be removed or modified. Please address questions on this policy, requests for exceptions, comments, or suggestions to archie-admin@sura.net This document is available from ftp.sura.net as pub/archie/docs/SURAnet-archie-guide.txt Table of Contents ----- -- -------- 1.0 Introduction 2.0 Data Stored In Archie System 3.0 Searching Through the Database 3.1 Exact Searches 3.2 Case Sensitive Searches 3.2.1 Examples of Case Sensitive Searches 3.3 Case Insensitive Searches 3.3.1 Examples of Case Insensitive Searches 3.4 Regular Expression Searches 3.4.1 Examples of Regular Expression Searches 3.4.2 Converting from File Matching Expressions to Archie Regexps 3.4.3 Examples of Converted File Matching Expressions 4.0 Different Access Methods 4.1 Remote Client Access 4.2 Interactive Interfaces 4.3 Mail Interface Appendix A: Regular Expressions A.1 Overview A.2 Using Regular Expressions A.3 Matching an arbitrary character A.4 Matching repeated characters A.5 Matching strings at the front or back A.6 Matching a character out of a set of choices A.7 Forcing special characters to be normal Appendix B: Glossary of Terms 1.0 Introduction --- ------------ The Archie service provides a database of files available for anonymous ftp, a way of retrieving files that are publicly available on the Internet. Users can locate files by searching the Archie database, which is indexed by the names of the files. Therefore, users need to know a part of the file's name to locate a file using Archie. This document describes: * The data in the archie database, so that people can understand what queries will produce useful results; * The different search methods, with examples; * The different programs that can search the Archie database; * A description of Archie regular expressions. 2.0 Data Stored In The Archie System --- ---- ------ -- --- ------ ------ The Archie system stores the names of files available for anonymous ftp. Therefore searches of the database need to use the name of the file, not words associated with the file. For example, a search for the dos version of the gnu C compiler requires knowing the name of that package is djgpp. The Archie systems acquires the names of files stored at anonymous ftp sites by ftp-ing to the site and getting a listing of all the files stored there, i.e. files indexed in the database use the naming conventions of each individual site instead of a common naming convention. To reduce the load on anonymous ftp sites, the Archie system retrieves the listings of files infrequently. Different Archie sites have different policies on how many sites they retrieve and how often they retrieve site listings, but usually about 50 listings are retrieved a night. Since there are over 1000 sites, each site is updated about once every 20 days. Furthermore, the different Archie sites are on different rotations for retrieving listings. Therefore, different sites can return different results, and newly added files are unlikely to be indexed in Archie until a few days after they are made available. 3.0 Searching Through the Database --- --------- ------- --- -------- There are four main ways of searching through the database: * Exact Searching (exact) -- Very fast, returns only exact filename matches * Case Sensitive Substring Searches (subcase) -- Medium fast, returns filenames with the substring in the filename * Case Insensitive Substring Searches (substring) -- Medium slow, returns filenames with the substring in any case in the filename * Regular Expressions (regexp) -- Slow, returns filenames which match the regular expression anywhere in the string 3.1 Exact Searching --- ----- --------- Exact searching is useful if the name of the desired program is known, and the user is looking for the closest copy. For example, a search for version 2.09 of xarchie using the string xarchie-2.0.9.tar.gz would find the file on 15 different hosts spread across the world. A search for the same file using the string xarchie-2.0.9.tar.Z would find the file on 25 different hosts. 3.2 Case sensitive substring searching (subcase) --- ---- --------- --------- --------- --------- Subcase searching is useful if part of the filename is known, but (for example) not the most current version. Subcase searching is more precise than substring searching if the case of the part of the filename is known; subcase searching will reduce the number of irrelevant matches. A search for the most recent version of the TeX package could use the search string TeX, which would find files which have the string TeX in them, but not those with only the string tex in them. 3.2.1 Examples of Case Sensitive Searches ----- -------- -- ---- --------- -------- Search String Returned Names ------ ------ -------- ----- TeX- TeX-3.14.tar.gz, SeeTex-2.18.5.tar.Z, TeX-index, GETTING-TeX-FILES, ... gcc-2.5 gcc-2.5.0-2.5.0a.diff.gz, gcc-2.5.0.tar.gz, gcc-2.5.2.tar.gz, gcc-2.5.0-cpp.ps.gz, gcc-2.5.3.tar.gz, not.gcc-2.5.0.tar.gz, gcc-2.5.0-2.5.1.diff, ... 3.3 Substring Searching --- --------- --------- Substring searching is useful if a portion of the filename is known, but not the most current version or the case of the filename. In the previous example of searching for TeX, a substring search would find filenames with either TeX or tex in them. 3.3.1 Examples of Case Insensitive Searches ----- -------- -- ---- ----------- -------- Search String Returned Names ------ ------ -------- ----- tex macros.tex, bib.tex, README.TeX, syd.tex, LNM.tex, ... TeX macros.tex, bib.tex, README.TeX, syd.tex, LNM.tex, ... TeX- revtex-30.hqx, videotex-terminal-tool.hqx, latex-style, jtex-1.43-1.44.tar.gz, jtex-util.tar.Z, psfig-tex-1.4.tar.Z, bibtex-style, latex-style-misc.ann, ... xf-2 Announcing-XF-2-2.z, xf-2.2.tar.gz, announcing-xf-2-2, ... 3.4 Regular Expression Searches --- ------- ---------- -------- Regular expressions allow specifying that the filename begin with a specific string, or specifying two sections of the filename separated by some unknown characters. Regular expressions are different from file matching expressions used by the shell at the command line; a conversion process is provided in section 3.4.2. A complete description of archie regular expressions is in Appendix A. 3.4.1 Examples of Regular Expressions Searches ----- -------- -- ------- ----------- -------- Searching for complete distributions of emacs 19: ^emacs.*19[^-]*\.tar Which will find filenames starting with emacs, having 19 somewhere after that, and then any character but a - until a .tar , i.e.: emacs-19.15.tar.gz, emacs-19.16.tar.gz, emacs-19.17.tar.gz, ... Searching for diffs between versions of emacs 19: ^emacs.*19.*-.*\.tar Which will find filenames starting with emacs, having 19 somewhere after that, then a dash somewhere after that and finally a .tar , i.e: emacs-19.16-A.bin.tar.gz, emacs-19.16-alpha.src.tar.Z, emacs-19.15-A.el.2of2.tar.gz, emacs-19.15-A.bin.tar.gz, ... Since the above expression didn't do what I expected: ^emacs.*19\.[0-9].*-.*[0-9][0-9] Which find filenames starting with emacs, having 19. somewhere after that, then separated by some number of characters, a dash, then some more characters, then two digits, which matches: emacs-19.12-19.13.diff.gz, emacs-19.11-19.12.diff.gz, emacs-19.10-19.11.diff.gz, ... Searching for a tar version of a package: ^xce.*tar Which will find filenames starting with xce and having tar somewhere after that, for example xce-1.00.tar.Z, xcell.tar.z and xce.tar.gz Searching for current sun bug fixes: [0-9][0-9][0-9][0-9][0-9][0-9]-[0-9][0-9]\.tar Which will find filenames which consist of six digits, then a dash, two digits, a period and then the string tar, for example 100075-09.tar.Z and 100149-03.tar.Z Searching for the third major revision of xv: ^xv.*3.*tar Which will find filenames starting with xv, some number of characters, the number three, some number of characters and then the string tar, for example, xv-3.00.tar.gz, xv-2.21.386bsd.bin.tar.Z, and xview-3.part3.tar.gz A better version of the above: ^xv-3\..*tar Which forces the first five characters to be xv-3., and then the string tar at some point later. Which finds: xv-3.00a.tar.Z and xv-3.00.tar.Z 3.4.2 Converting from File Matching Expressions to Archie Regexps ----- ---------- ---- ---- -------- ----------- -- ------ ------- File matching expressions, or file globs, are typed in to the shell or command line to match filenames. Common examples are *.txt, *.exe, x*, p*.zip, etc. The following rules should convert file globs to Archie regexps. An explanation of what each symbol does can be found in Appendix A. 1. Prepend ^ to the file glob, and $ to the end 2. Replace all occurrences of . with \. 3. Replace all single character matches, such as ? with . 4. Replace all multiple character matches, e.g. * with .* Note that regular expression matches are case sensitive. If arbitrary cases for each letter should match, then each letter needs to be replaced with the upper and lower case versions of that letter in brackets. E.g. replace b with [bB] If a command line interface is used for searching, then single quotes (') probably need to surround the regular expression argument. If neither the single nor multiple character matches have been used, then the search can probably be performed using a substring or subcase search, which should be faster than the regular expression search. 3.4.3 Examples of Converted File Matching Expressions ----- -------- -- --------- ---- -------- ----------- File Expression Archie Regexp ---- ---------- ------ ------ *.txt ^.*\.txt$ *.exe ^.*\.exe$ x* ^x.*$ p*.zip ^p.*\.zip$ 4.0 Different access methods --- --------- ------ ------- The Archie database resides on a number of different machines on the Internet. A server on each machine allows users to query the database from their machines. Clients programs are programs which contact the server to search the database. By separating the client from the server, the load on the server is reduced, allowing faster processing of the searches. 4.1 Remote Client Access --- ------ ------ ------ Remote client access is the preferred method to access the database. Since the Archie machine must only process the query, it has to do much less work than if the telnet or mail interfaces are used. This is especially true for version 2.X of the Archie system which handles interactive and mail queries very inefficiently. All version numbers and locations are the best sites I know as of Nov 14, 1993. Archie can be used to locate newer versions, or find copies which are stored closer to your site. As the clients are updated, the version numbers will change. Therefore, after finding where a copy of the programs are stored, check to make sure that the most recent version is retrieved. There are two main programs which provide client access, the c-archie client and the xarchie program. There is a version of the c-archie client which has been compiled for VMS machines. The c-archie client can be retrieved from ftp.sura.net as pub/archie/clients/c-archie-1.4.1-FIXED.tar.Z The VMS version of the c-archie client can be retrieved from ftp.sura.net as /pub/archie/clients/c-archie-1.3.2-vms.com The x-archie client can be retrieved from ftp.x.org as contrib/xarchie-2.0.9.tar.Z There is also a perl archie client, available from ftp.sura.net as pub/archie/clients/perl-archie-3.8.tar.Z A NeXT archie client can be retrieved from ashley.cs.widener.edu as pub/archie/archie-NeXT.tar.Z A mac archie client is available from mac.archive.umich.edu as /mac/util/comm/anarchie1.00.sit.hqx There appears to be a mac client available from pprg.eece.unm.edu as /pub/Mac/sumex/comm 4.2 Interactive Interfaces --- ----------- ---------- The interactive interface should be used only if the remote clients are unavailable. Users can telnet to any archie site and login as archie. On the host archie.sura.net, users can log in as qarchie, which provides a faster interactive interface. 4.3 Mail Interface --- ---- --------- Mail clients should be used as a last resort. By sending e-mail to archie@ with the word help in the body or subject, users can receive a file which explains how to use the mail interface to archie. Appendix A: Regular Expressions -------- -- ------- ----------- A.1 Overview --- -------- Regular expressions are very powerful ways of describing filenames. They allow the following features: * Forcing certain characters to be repeated exactly n times, or at least n times, where n >= 0 * Specifying that a string should occur at the front or back or a matched filename * Specifying sets of characters which can match The following characters are special in Archie regular expressions: . ^ * $ \ [ ] The regular expressions used in Archie are known as ed regular expressions because they were derived from the ed editor which is found on UNIX workstations. ed regular expressions are different from the file matching expressions that are used at the command line. Section 3.4.2 describes how to convert file matching expressions to archie regexps. A.2 Using Regular Expressions --- ----- ------- ----------- Individually, many of these capabilities are not useful or can be better handled as another type of search. However, in combination they can accurately specify the set of names desired, lowering the number of useless matches. I usually think about regular expressions as a series of pieces of a filename stuck together, allowing me to understand what filenames will be matched. A.3 Matching an arbitrary character --- -------- -- --------- --------- To match an arbitrary character, the . symbol is used. For example the regular expression ........ will match all filenames which have at least eight characters in them, because Archie regular expressions match any part of a filename. A.4 Matching something repeated multiple times --- -------- --------- -------- -------- ----- A large amount of power in regular expressions comes from the ability to match a specified set of characters repeated an arbitrary number of times. This is more powerful than file globs which cannot specify a set. The * character means that the previous element in a regular expression should be matched an arbitrary number of times. For example to match something which has ctwm and tar in the filename, separated by some set of arbitrary characters, the regular expression ctwm.*tar would be used. To find the versions of c2man which were archived from one of the comp.sources groups, the regular expression c2man.*[0-9][0-9]* would be used. It would find all filenames with c2man in the string, followed by some or no characters, followed by at least one digit. This search would match filenames like c2man-2.0.13.tar.gz, c2man-2.03, c2man-1.10.tar.Z A.5 Anchoring strings at the front or the back --- --------- ------- -- --- ----- -- --- ---- To anchor a particular string to the front or the back of a matching filename, a ^ is put at the front or a $ at the back. For example, to find zip file, the search string zip$ would be used. To find filenames starting with gdb, the string ^gdb would be used. These options can be combined, the search string ^foo$ would match a file which has precisely the name foo. This particular search would be better as an exact search because the results would be returned faster, however a search for compressed versions of gcc, using the regexp ^gcc.*tar\.Z$ would be good uses of the power of regular expressions. In the earlier example of ........ , filenames of exactly eight characters could be matched using ^........$ A.6 Matching a character out of a set of possibilities --- -------- - --------- --- -- - --- -- ------------ There are two ways to specify characters to be matched. First as a list of possibilities, and second as a list of unacceptable possibilities. To match any digit, the regular expression [0-9], which says match any character from 0 to 9. The expression [a-zA-Z] would match any alphabetic character, and the expression [aeiou] would match any vowel. To match characters not in a set, the ^ character is placed after the left bracket. To find a file which starts with tar, has some characters, then a non-alphabetic character, some characters and then tar, the regular expression tar.*[^a-zA-Z].*tar could be used. A.7 Forcing special characters to be normal --- ------- ------- ---------- -- -- ------ . ^ * $ \ [ ] are characters with special meanings in Archie regular expressions. To specify that these characters are exactly matched, the special character needs to be escaped, by putting the \ character in front of it. For example, the expression \.tar$ would find filenames ending in .tar, but the expression .tar$ could find filenames ending in ftar. Appendix B: Glossary -------- -- -------- anonymous ftp Anonymous ftp is a way of accessing publicly available files. Normally you would use the ftp command with the user name anonymous. It is customary to give your e-mail address as the password so that people will know who is retrieving files; indeed, some sites require a valid e-mail address before allowing you to retrieve files. anchoring Anchoring means forcing a section of a filename to match to either the front or the back of the filename, i.e. the string is anchored to the front[back] of the filename. case Upper(A-Z) or lower(a-z) case. Case insensitive searches consider A to be equivalent to a, B to b, etc. client A program which runs on one machine and accesses a server to gather some form of information command line Where commands are typed in to the prompt. Usually the program that receives the commands is called the shell. compress A program which removed redundancy producing a shorter output file. diffs Differences. When a package is upgraded to a new version, authors usually provide the differences between the old and new version of the package because the differences are smaller, and because by applying the differences to their own copy of the sources, users are not forced to re-make any local changes. ed One of the first editors made available on unix workstations. The regular expressions used in ed are very similar to the ones used in archie. e-mail Electronic Mail. A common way of communicating across the Internet. file globs See file matching expressions. file matching expressions (file globs) Expressions that are typed into the shell at the command line to specify files. File globs are useful for specifying multiple files. host see site Internet The Internet is the collection of hosts connected through the NSFnet backbone, which started as a DARPA project. The Internet now reaches sites across the world. package A collection of files that make up a program. Usually the sources to a particular program, but the parts of a package can include data files, binaries, etc. regular expressions A name for expressions which can be matched by finite automata, a "machine" with a finite number of states which can change from one state to another given a single character of input. regexps see regular expressions shell A program which parses file globs and executes programs. Usually a shell has other features such as input/output redirection, repeating, job control, etc. See also command line. server The part of a server-client system which receives requests from the client, processes those requests and returns the results to the client. There is an archie server running on the machines which provide archie service. subcase Case sensitive substring matches, which means that the filename matched must have the same case as the search string. See also substring and case. substring Case insensitive substring matches, which means that the filename matched can have any case relative to the search string provided that the letters are the same. See also subcase and case. site Usually a machine on the Internet, for example the anonymous ftp site ftp.sura.net. Sometimes generalized to mean a group of machines, for example the Carnegie Mellon site. tar A program which gathers a collection of files together into one file for transmission or storage. tar preserves the names and subdirectories of the gathered files.