Internet Draft (SHAVE) C. Adie
Expires April 1994 Edinburgh University
Filename draft-adie-shave-00.txt 12 October, 1993
SGML-based Hierarchical Attribute/Value Encoding (SHAVE)
STATUS OF THIS DOCUMENT
This document is an Internet Draft. Internet Drafts are working
documents of the Internet Engineering Task Force (IETF), its
Areas and its Working Groups. (Note that other groups may also
distribute working documents as Internet Drafts.)
Internet Drafts are draft documents valid for a maximum of six
months. Internet Drafts may be updated, replaced, or obsoleted
by other documents at any time. It is not appropriate to use
Internet Drafts as reference material or to cite them other than
as a "working draft" or as "work in progress". To learn the
current status of any Internet Draft, please check the 1id-
abstracts.txt listing contained in the Internet Drafts Shadow
Directories on ds.internic.net, nic.nordu.net, ftp.nisc.sri.com,
or munnari.oz.au.
SUMMARY
The usefulness of attribute/value pairs for conveying information
is well established. There is a need for a standard text-based
method of representing attribute/value data, which is capable of
being easily written and read by humans, and also easily
processed by a computer program. Often, such data is required to
be transferred in electronic mail messages. This document
describes how SGML (Standard Generalized Markup Language) can be
used as the basis for such a representation.
C. Adie 1
Internet Draft 12 October, 1993 (Expires April 1994)
Table of Contents
1. Introduction..........................................3
1.1 Requirements....................................4
1.2 SGML............................................4
1.3 The SHAVE Approach..............................5
2. General SGML Environment..............................7
3. DTD Restrictions......................................8
4. Document Instance Restrictions........................12
5. Examples..............................................14
6. References............................................18
7. Security Considerations...............................18
8. Acknowlegements.......................................19
9. Contact...............................................19
C.Adie draft-adie-shave-00.txt 2
Internet Draft 12 October, 1993 (Expires April 1994)
1. Introduction
The use of attribute/value pairs for conveying information is
well established. Although ASN.1 is often used to transmit
attribute/value information (eg in X.500), it is not a human
readable representation.
There is a need for a standard text-based method of representing
attribute/value data, which is capable of being easily written
and read by humans, and also easily processed by a computer
program. Often, such data is required to be transfered in
electronic mail messages.
Typical applications for such a representation are:
. Exchange of personal contact information (such as might be
shown on a business card). A recipient might process this
information using a database program.
. Exchange of meta-information concerning a resource (eg a file
or a service) on the Internet.
. Dissemination of information regarding an event (a meeting,
lecture, conference etc.). A recipient might process this
information using a personal time manager or simple diary
program.
. As a format for an electronic mail "form", for form-filling
applications.
This document describes how SGML (Standard Generalized Markup
Language) can be used as the basis for such a representation.
This document draws heavily on the work done by Dave Crocker on
STIF [Crocker 93a] and PCI [Crocker 93b].
C.Adie draft-adie-shave-00.txt 3
Internet Draft 12 October, 1993 (Expires April 1994)
1.1 Requirements
The requirements for an attribute/value text format are as
follows, roughly in order of importance.
1. Must be easy to read by people who are not computer experts.
2. Must be easy to write by non-experts, using very simple
written instructions.
3. Must be easy to write a parser for.
4. Must be capable of handling multi-valued attributes, where
there may be significance in the ordering of values.
5. Must be capable of handling nested attribute/value pairs - ie
attributes whose values are attribute/value pairs.
6. Must be capable of handling non-text attribute values by
reference.
7. Must be able to handle attribute values which are not in US
ASCII.
8. It must be easy for a computer program to identify that part
of a text file which contains attribute/value information
relevant to it.
1.2 SGML
Why use SGML to achieve the above aims? There are several
reasons:
. It is the obvious tool for the job, being designed explicitly
for structuring text and offering a great deal of flexibility.
C.Adie draft-adie-shave-00.txt 4
Internet Draft 12 October, 1993 (Expires April 1994)
. It is supported by a range of existing commercial and public
domain tools.
. It is easy for non-expert users to read and write SGML.
. It lends itself to hierarchically-structured data.
. It has very few "special characters" which require escaping in
normal text.
To learn about SGML, there's a good little booklet called "The
SGML Primer", written by SoftQuad [SoftQuad 91], which offers a
rapid walk through the main features of the language. The
remainder of this document assumes you have read this booklet (or
have an equivalent level of SGML knowlege).
The SGML "bible" is The SGML Handbook [Goldfarb 90], which
contains the complete text of the ISO standard, as well as a
great deal of excellent explanation.
1.3 The SHAVE Approach
The main problem with representing attribute/value information
using SGML is one of nomenclature - the word "attribute" has a
specific meaning in SGML, which differs from the meaning we've
given it so far. To resolve the conflict, in the rest of this
document we will use the term "attribute" in its SGML meaning,
and speak about "parameter/value" data instead of about
"attribute/value" data.
It is envisaged that such SGML-encoded parameter/value data will
be parsed by two kinds of programs:
. Programs which are based on generic SGML parsers, which base
their parsing on an externally-stored SGML Document Type
Definition (DTD).
C.Adie draft-adie-shave-00.txt 5
Internet Draft 12 October, 1993 (Expires April 1994)
. Programs which do not understand SGML or DTDs per se, and are
specific to a particular application. Such programs might be
written in awk or Perl, or any other programming language.
In order to make it easier to write programs of the second kind
("naive parsers"), we will introduce restrictions which limit the
way in which an SGML document and DTD may be written. This
document therefore describes some general rules (called "the
SHAVE rules") for representing parameter/value data in SGML. It
does not define a DTD, but describes restrictions which:
. restrict how an application should define its own DTD.
. restrict how a document instance should be written, over and
above the rules imposed by the DTD and by SGML.
The remainder of this document is structured as follows. Chapter
2 describes assumptions about the general SGML environment.
Chapter 3 lists rules which restrict how DTDs may be written.
Chapter 4 describes restrictions on document instances. Chapter
5 (which is not formally part of the specification) gives
examples of how the restrictions work.
For ease of identification and reference,
0. SHAVE rules are numbered, indented and printed in italics,
like this.
Following each rule is an explanation and/or justification of the
rule, if required.
C.Adie draft-adie-shave-00.txt 6
Internet Draft 12 October, 1993 (Expires April 1994)
2. General SGML Environment
1. The SGML Reference Concrete Syntax shall be used, with the
modification that the syntax-reference character set shall
be ANSI X3.4 instead of ISO646 IRV.
This rule implies that the SGML Reference Delimeter set is used
(see Goldfarb p477). Among other things, it also implies that the
alphabetic case of element names is not significant, but the case
of entity names is significant.
2. The SGML Reference Quantity Set shall apply.
The Reference Quantity Set defines various limits which are
usually left to the discretion of an implementor - see Goldfarb
p470. This rule ensures that a SHAVE document may be parsed by
any generic SGML parser without having to worry about the
parser's implementation limits. The most important restriction
is on the maximum length of SGML element and attribute names,
which must not exceed 8 characters.
C.Adie draft-adie-shave-00.txt 7
Internet Draft 12 October, 1993 (Expires April 1994)
3. DTD Restrictions
A "SHAVE application" is the definition of a DTD and associated
semantics, together with program(s) which parse conforming
document instances. This Chapter defines restrictions with which
a SHAVE application's DTD must comply.
3. A parameter shall be represented as an SGML element.
This is the key rule for understanding how SGML represents
parameter/value information. The name of the parameter is the
element name, and the value of the parameter is the element
content.
4. A parameter which takes a single text value shall have a
content model of (#PCDATA).
5. A parameter whose value is a set or sequence of other
parameter/value pairs will have a content model which
defines the parameters which may occur within it.
These rules simply state the obvious modelling of hierarchical
parameter/value data by nested SGML elements.
6. A parameter which takes a list of text values shall have a
content model similar to (item+), where the item element is
defined as:
The declaration of item (or a similar element) must be included
in the DTD. The following Chapter describes restrictions on
document instances relating to the use of this form of element.
7. If the value of an element contains both (a) a text portion,
and (b) subidiary parameter/value information, then the
content model of the corresponding element shall specify
C.Adie draft-adie-shave-00.txt 8
Internet Draft 12 October, 1993 (Expires April 1994)
that the #PCDATA representing (a) preceeds the other
elements which represent (b).
DTD designers should avoid specifying such content models -
elements preferably should contain either other elements or
#PCDATA, not both. However, there may be cases where both are
needed in a single element, and in such cases the #PCDATA should
come first in order to aid legibility.
8. The following SGML entity declarations shall be included in
an application DTD:
Any octets (subject to the constraints of the character set in
effect) are permitted in the #PCDATA content of an element,
except for octets with decimal values 38 (the ampersand) and 60
(the less-than sign). These must be represented by entity
references of the form & and < respectively.
9. Applications which define parameters taking values which are
stored externally to the document instance, shall do so
using elements with a content notation defined in the
application's DTD.
An application program which processes the document instance may
then choose to resolve the reference by retrieving the referenced
value. This approach has been chosen rather than the "external
entity" feature of SGML in order to meet criteria 1 and 2 of
section 1.1 above.
10. By default, #PCDATA within elements comprises
characters from ANSI X3.4. Applications which wish to allow
the use of alternative character sets shall provide an
optional attribute cset for each element, which takes a
value which uniquely identifies the character set used in
#PCDATA within that element and all included elements which
do not specify their own cset attribute.
C.Adie draft-adie-shave-00.txt 9
Internet Draft 12 October, 1993 (Expires April 1994)
The default character set for #PCDATA is US ASCII. All markup is
done in that character set. Elements inherit cset values from
their enclosing elements - unfortunately SGML does not provide a
keyword to express this.
We need to specify a list of character set names which can be
used as values of cset attributes.
11. Where an SGML attribute is specified in the DTD as
taking one of a fixed number of values, the values shall be
distinct from values of all other attributes of the same
element type.
SGML attributes may be used to qualify the content of an element.
This rule allows the attribute name to be omitted in document
instances for certain types of attribute, as described in the
following Chapter. SGML attributes which take CDATA values do
not fall into this category.
12. Application designers who wish to allow the use of
experimental parameters shall define the following element
and attribute in their DTD:
Individuals may then use the X element with a type of their
choice to contain experimental values. Note two points in
particular:
. The end tag is required.
. SGML requires that no other end tag may occur within an X
element.
13. If a MIME registration is required for a SHAVE format,
the registered MIME subtype shall be used in the DTD as the
document type name.
C.Adie draft-adie-shave-00.txt 10
Internet Draft 12 October, 1993 (Expires April 1994)
The top-level MIME content type will probably be either "text"
(if the data can usefully be displayed in text form) or
"application" (otherwise). Applying for registration is the
responsibility of the application designer(s). Note that SHAVE
does not itself have or require a MIME registration.
C.Adie draft-adie-shave-00.txt 11
Internet Draft 12 October, 1993 (Expires April 1994)
4. Document Instance Restrictions
The rules in this Chapter restrict how markup occurs. The
purpose of these restrictions is twofold - to make it easier for
a human to read the document, and to make it easier to write a
parser which is not fully SGML-aware.
14. Comments in a document instance shall start with the
four-character sequence .
This restriction prohibits spaces from occurring between the --
and the > at the end of the comment (this is normally allowed in
SGML). The empty comment and comments within tags are
excluded by this rule. Comments are discarded by the parser.
15. Start tags shall not be omitted, and tags shall not be
shortened, except where otherwise specified in these rules.
16. When a tag occurs in a document, its opening delimeter
(less-than sign) must occur as the first non-blank character
on the line.
These rules makes things much easier for non-SGML-aware programs.
Note that the element name immediately follows the opening
delimiter, but the closing delimeter (greater-than sign) may be
preceeded by white space.
17. Within an element which contains #PCDATA, white space
occurring before the first non-white character and white
space occurring between the last non-white character and the
opening delimeter of the tag which closes the element, shall
be discarded by the parser.
This rule ensures that leading spaces, newlines etc, introduced
for legibility at the beginning and end of an element are
ignored. However, white space within the text of an element is
C.Adie draft-adie-shave-00.txt 12
Internet Draft 12 October, 1993 (Expires April 1994)
not ignored by the parser and may be treated as significant by
the application.
18. Where an SGML attribute is specified in the DTD as
taking one of a fixed number of values which are distinct
from values of all other attributes of the same element
type, the attribute name shall be omitted in document
instances.
This rule refers to the "Attribute Minimisation" feature of SGML,
as described in the ISO standard Annex C.1.1.3 (Goldfarb p70),
which permits the omission of the attribute name. The above
SHAVE rule makes this omission mandatory.
19. Where an element has a content model of the form
(item+) to represent a list of values, and each item within
the element contains only #PCDATA, the following
minimisations shall occur within the enclosing element: (a)
All item end tags shall be omitted. (b) The first item
start tag after the start tag of the enclosing element shall
be omitted. (c) Other occurrences of the item start tag
shall be shortened to the empty tag <>.
This rule allows the pair of characters <> to be used as
separators in a list of values in some common circumstances (see
Goldfarb p75).
20. The Reference Close delimeter (;) shall not be omitted.
This delimeter terminates an entity reference, and must always be
present. (The use of Record End as a Reference Close delimeter
is thus not permitted.)
C.Adie draft-adie-shave-00.txt 13
Internet Draft 12 October, 1993 (Expires April 1994)
5. Examples
This chapter gives some examples of how some of the SHAVE rules
might apply to a particular imaginard SHAVE application. This
chapter is not definitive. For further examples, see [Adie 93],
which is a complete specification of a SHAVE-conformant format.
Rules 6 and 19
If an application requires a computer parameter which takes a
list of values, it might define a DTD containing:
and a document instance might contain the following list
containing three items:
Apple Macintosh
<> IBM PC
<> Sun Sparcstation
Note that if an opsys element is present, then rule 19 does not
apply and every item start tag must be present:
- Apple Macintosh
- IBM PC
DOS
- Sun Sparcstation
These rules are complex, but have the advantage of maintaining
the legibility of the document instance.
Rule 7
C.Adie draft-adie-shave-00.txt 14
Internet Draft 12 October, 1993 (Expires April 1994)
This rule is obeyed by the following element declaration:
It is NOT obeyed by the following element declaration, which
permits the bar1 and bar2 elements to occur before the #PCDATA:
Rule 9
To define a parameter the value of which is accessed through a
URL, an application would define a notation and an element thus:
The foo element in the document instance would then contain the
URL:
ftp:://ftp.ed.ac.uk/pub/mmaccess/mmaccess.ps.Z
Rule 10
Suppose a fargle element is defined as follows:
Then a document containing a fargle element containing the string
A&o in a character set called Code Page 437 might contain the
following:
*
where * stands for the following sequence of octets:
C.Adie draft-adie-shave-00.txt 15
Internet Draft 12 October, 1993 (Expires April 1994)
8/14 2/6 6/1 6/13 7/0 3/11 9/5
Note that the occurrence of octet 2/6 in the selected character
set is replaced by the octet sequence 2/6 6/1 6/13 7/0 3/11
(ie & in the default character set). The parser would
replace this octet sequence with the single octet 2/6 before
emitting it to the application. Note that 2/6 need not
correspond to the ampersand in the selected character set
(although in this case it does). The point is that the SGML
parser cares about octet 2/6, not the octet(s) which happen to
represent the ampersand in the selected character set.
Octet 3/12 (the less-than sign in the default set) would be
treated similarly.
Rule 17
This rule means that in the following document fragment:
John Smith
Leader of the Opposition
the data may be lined up vertically without the leading spaces or
tabs or the trailing newlines being treated as significant data.
However, in the fragment:
Weave a circle round him thrice
And close your eyes with holy dread
the newline and intervening spaces between the two phrases are
preserved by the parser.
Rule 18
This rule implies that given the DTD fragment:
C.Adie draft-adie-shave-00.txt 16
Internet Draft 12 October, 1993 (Expires April 1994)
a document should be encoded:
foo@bar.com
and NOT as:
foo@bar.com
C.Adie draft-adie-shave-00.txt 17
Internet Draft 12 October, 1993 (Expires April 1994)
6. References
[Adie 93] SGML-based Personal Contact Information (SPCI), C.Adie,
September 1993. Internet Draft (work in progress).
[Crocker 93a] Structured Text Interchange Format (STIF), D.
Crocker, June 1993. Internet Draft (work in progress).
[Crocker 93b] Encoding for Personal Contact Information (PCI),
D. Crocker, June 1993. Internet Draft (work in
progress).
[Goldfarb 90] The SGML Handbook, C. Goldfarb, Oxford University
Press 1990 (ISBN 0-19-853737-9).
[SoftQuad 91] The SGML Primer, SoftQuad Inc 1991. Available for
$10 from:
SoftQuad Inc
+1 416 239 4801
56 Aberfoyle Crescent, Suite 810
Toronto
Canada
M8X 2W4
7. Security Considerations
There are no security implications of this specification.
C.Adie draft-adie-shave-00.txt 18
Internet Draft 12 October, 1993 (Expires April 1994)
8. Acknowlegements
This work was inspired by Dave Crocker's work on STIF [Crocker
93a] and PCI [Crocker 93b]. Lou Burnard provided helpful
comments.
If you provide constructive comments, you could find your name
appearing here.
9. Contact
Chris Adie
Computing Service
Edinburgh University
C.J.Adie@edinburgh.ac.uk
+44 31 650 3363
+44 31 662 4809
University Library Building
George Square
Edinburgh
EH8 9LJ
Great Britain
C.Adie draft-adie-shave-00.txt 19