Comparison of the Network Time Protocol and Digital Time Service

Editor's Note

This document includes transcripts of an exchange of messages between
Dave Mills of UDel, Dennis Ferguson of UToronto, Joe Comuzzi of DEC and
Mike Soha of DEC. The issue under discussion is a comparison and
evaluation of the Network Time Protocol (NTP) and the Digital Time
Service (DTS). It is important to point out that these messages are
informal, sometimes opinionated and may contain errors of judgement and
technical detail. Some points of confusion and misstatement in the
initial exchanges are clarified as the discussion moves on. The messages
have been lightly edited to remove nonrelevant asides and repetitive
material, as well as to unify format style.

This document is provided for informal, collaborative use in research
only and should not be quoted or cited in a professional publication.

David L. Mills
Electrical Engineering Department
University of Delaware
1 September 1990

------------------------------------------------------------------------

Draft document distributed to the NTP engineering group on 12 February
1990:

   A Comparison of the Network Time Protocol and Digital Time Service

                             David L. Mills
                   Electrical Engineering Department
                         University of Delaware

1. Introduction

The Digital Time Service (DTS) for the Digital Network Architecture
(DECnet) is intended to synchronize time in computer networks ranging in
size from local to wide-area. As such it is intended to provide service
comparable to the Network Time Protocol (NTP) for the Internet
architecture. This memorandum compares the architectures, functions and
design issues of NTP and DTS with respect to correctness, stability,
accuracy and reliability. It is based on information available in the
various RFCs [MIL89], [MIL90b], journal articles [MIL90a], [MIL90b] and
other sources [MIL90c] for NTP and on the DTS functional specification
[DEC89].

In this memorandum the stability of a clock is how well it can maintain
a constant frequency, the accuracy is how well its frequency and time
compare with national standards and the precision is how precisely these
quantities can be resolved within a particular timekeeping system. The
offset of two clocks is the time difference between them, while the skew
is the frequency difference (first derivative of offset with time). Real
clocks exhibit some variation in skew (second derivative of offset with
time), which is called drift. The correctness of a timekeeping system is
the degree to which it indicates valid UTC according to some criteria,
while its reliability is the fraction of the time it can be kept
operating and connected in the network.

Local clocks are maintained at designated time servers, which are
timekeeping systems belonging to a synchronization subnet in which each
server measures the offsets between its local clock and the clocks of
other servers in the subnet. In this memorandum to synchronize frequency
means to adjust the clocks in the subnet to run at the same frequency,
to synchronize time means to set them to agree at a particular epoch
with respect to Coordinated Universal Time (UTC), as provided by
national standards, and to synchronize clocks means to synchronize them
in both frequency and time. The goal of a distributed timekeeping
service such as NTP and DTS is to synchronize the clocks in all
participating servers and clients so that all are correct, indicate the
same time relative to UTC, and maintain specified measures of stability,
accuracy and reliability.

By international agreement the primary frequency reference for our
civilization is the atomic oscillator. The standard second is determined
as a specified number of atomic cycles, the standard day as 86,400
standard seconds and the standard (Julian) year as 365.25 standard days.
In order to maintain the nominal solar year, the Gregorian calendar
mandates the insertion of leap days, which in the absence of political
upheaval can be determined far in advance. In order to maintain the
nominal solar day, leap seconds must be inserted at times which cannot
be reliably determined in advance. The basis of civil time is the UTC
clock, which first ticked at 0h on 1 January 1972. Without knowledge of
prior leap seconds, an event determined on the UTC timescale can appear
some 15 seconds late on the Julian timescale.

Both the NTP and DTS timescales are based on the UTC timescale and are
intended to run at atomic frequency with leap seconds inserted at times
decreed by international agreement; however, they are each calibrated in
different increments starting from different historic dates. While they
both are intended to thrive in large, undisciplined network systems,
they differ considerably in their statistical models, algorithms and
service metrics. In following sections the similarities and differences
are discussed at length, along with implications bearing on correctness,
stability, accuracy and reliability.

2. Basic Principles and Functionality

Both NTP and DTS are designed for use in proliferated computer networks
with possibly many embedded local nets interconnected by routers,
gateways and bridges and involving both broadcast and point-to-point
transmission media. This section summarizes the architecture and service
objectives of NTP and DTS in turn.

2.1. The Network Time Protocol

The NTP synchronization subnet consists of a tree-structured graph with
nodes representing time servers and edges representing the transmission
paths between them. The root nodes of the tree are represented by
designated primary servers which synchronize to a radio broadcast or
calibrated atomic clock. The remaining nodes are designated secondary
servers which synchronize to other servers, both primary and secondary.
The number of subnet hops between a particular server and a primary
server determines the stratum level of that server. All servers, except
possibly those at the leaves of the tree, have identical functionality
and can operate simultaneously as clients of the next lower stratum
level and servers for the next higher one.

Servers, both primary and secondary, typically run NTP with several
other servers at the same or lower stratum levels; however, a selection
algorithm attempts to select the most accurate and reliable server or
set of servers from which to actually synchronize the local clock. The
selection algorithm, described in more detail later in this document,
uses a maximum-likelihood clustering algorithm to determine the best
from among a number of possible servers. The synchronization subnet
itself is automatically constructed from among the available paths using
the distributed Bellman-Ford routing algorithm [BER87], in which the
distance metric is modified hop count.

NTP operates in various modes in order to improve efficiency on local
wires with many clients. These support operation in conventional RPC
client/server modes, as well as symmetric and multicast modes. The
symmetric modes provide a flexible backup function in which the
direction of time synchronization between a pair of servers can reverse
due to loss of reachability or quality of service along one path or
another in the synchronization subnet. The multicast mode is designed to
provide time to personal workstations where the full accuracy of the
other modes is not required.

The NTP specification includes no architected procedures for servers to
obtain addresses of other servers other than by configuration files and
public bulletin boards. While servers passively respond to requests from
other servers, they must be configured in order to actively probe other
servers. Servers configured as active poll other servers continuously,
while servers configured as passive poll only when polled by another
server. There are no provisions in the present protocol to dynamically
activate some servers should other servers fail.

In response to stated needs for security features, NTP includes an
optional cryptographic authentication mechanism. NTP also includes an
optional comprehensive remote monitoring mechanism found necessary for
the detection and repair of various problems in server and network
configuration and operation. It is anticipated that, when generic
features capable of these functions have been developed and deployed in
the Internet, the NTP authentication and monitoring mechanisms may be
withdrawn.

2.2. The Digital Time Service

In DTS a synchronization subnet consists of a structured graph with
nodes consisting of clerks, servers, couriers and time providers. With
respect to the NTP nomenclature, a time provider is a primary server, a
courier is a secondary server intended to import time from one or more
distant primary servers for local redistribution and a server is
intended to provide time for possibly many end nodes or clerks. Time
providers, servers and couriers are evidently generic, in that all
perform similar functions and have similar or identical internal
structure. The intent is that time providers can be set from radios,
telephone calls to NIST [NBS88] or even manually.

As in NTP, DTS clients and servers periodically request the time from
other servers, although the subnet has only a limited ability to
reconfigure in the event of failure. The selection algorithm used in
DTS, which is based on the work presented in Marzullo's dissertation and
reported in [MAR85], will be discussed in detail later in this document.

On local nets DTS servers multicast to each other in order to construct
lists of servers available on the local wire. Clerks multicast requests
for these lists, which are returned in monocast mode similar to ARP in
the Internet. Couriers consult the network directory system to find
global time providers. For local-net operation more than one server can
be configured to operate as a courier, but only one will actually
operate as a courier at a time. There does not appear to be a multicast
function in which a personal workstation could obtain time simply by
listening on the local wire without first obtaining a list of local
servers.

In the DTS model the directory, authentication and management functions
are provided by other layers, entities and protocols in the DECnet
architecture. As evident from other documents in the DECnet
specification suite, these functions are evidently highly developed and
integrated in the architecture and presumably provide equivalent
functionality as the NTP authentication and monitoring mechanisms.

3. Statistical Models and Data Representation

Perhaps the widest departure between the NTP and DTS philosophies is the
basic underlying statistical model. NTP is based on maximum-likelihood
principles and statistical mechanics, where errors are expressed in
terms of expectations. DTS is based on provable assertions about the
correctness of a set of mutually suspicious clocks, where errors are
expressed as a set of computable bounds on maximum time and frequency
offsets. This section explores these models and how they affect the
quality of service.

3.1. Statistical Models

The conventional analytical model for real synchronized clocks [ALL74],
[MIT80] consists of a set of oscillators connected by transmission
paths. The oscillators are characterized by a set of random variables
that describe their intrinsic time and frequency offsets relative to a
reference timescale. In this analysis the time and frequency of an
oscillator cannot be known exactly and must be described using random
variables with assumed or derived probability density functions. It is
possible to quantify absolute upper and lower limits of accuracy only
with respect to the these functions. In conventional analysis the
transmission paths between the clocks are modelled as stochastic delays,
with assumed distributions, usually of exponential type. These paths are
used to exchange timing information and adjust the time and frequency
offsets of each oscillator. The behavior of individual clocks, both in
accuracy and stability, can then be characterized using standard
engineering tools, such as the theory of phase-locked loops familiar to
communications engineers [LIN80], [SMI86].

In the model used by DTS and several others in the literature (see
MIL90b]) a number of assumptions are made, explicitly or implicitly,
about the shape of the probability density functions describing the
inherent accuracy and stability of clocks, as well as the delays on the
transmission paths connecting them. DTS assumes the reading error
(offset relative to UTC) of a clock has a computable bound. In addition,
DTS assumes the frequency error (called drift in the DTS functional
specification) is bounded by an implementation constant. A correct clock
never strays outside these bounds, which are computed from the inherent
characteristics of the clock, the inherited characteristics of its
selected synchronization source, the measured propagation delay and the
accumulated error since last updated. If it is assumed that real clocks
and transmission paths can be modelled reliably in this fashion, then
the DTS algorithm can maintain a system of correct clocks.
The philosophy inherent in the NTP algorithms is to consider all
possible information available in the stochastic model and measured
statistics to arrive at a probabilistic conclusion. The time delivered
by an NTP subnet is intended to be the most likely measurement in a
probabilistic system in which all measurements have been weighted toward
the most likely outcome. However, there is no guarantee that all clocks
in the system are valid in the sense of provable correctness. No attempt
is made to determine correctness other than on a statistical basis.

On the other hand, DTS starts with a set of assumptions on the bounds of
error and growth of error bounds with time. The clock selection
algorithm is based on an intersection operation which preserves
correctness according to these assumptions. As long as the underlying
probability distributions can be bounded absolutely, DTS will deliver
provably correct time, if at all. But, real distributions seldom behave
this way, especially in the Internet, where the distributions can have
surprisingly long tails. Thus, there will almost always exist a tail in
the distribution which can be truncated only at the expense of some
error. In other words, the DTS assumptions must be considered valid only
in the context of the probability distributions actually observed.

In summary, NTP attempts to deliver the best estimate of time and
frequency with presumably the lowest estimated error, but can not
guarantee the correctness of the indication relative to an arbitrary set
of rigid assumptions. DTS attempts to deliver correct time according to
stated assumptions, but correctness can be guaranteed only with respect
to these assumptions and these assumptions can be guaranteed only on a
probabilistic basis.

3.2. Maximum Likelihood Estimation

It is possible to obtain various measures of expected error when
processing timekeeping data and use these measures to establish
preference in the various estimation algorithms. Called maximum-
likelihood estimation, these techniques are widely used in signal
processing and communication systems. For instance, experiments show
[MIL90b] that, in a list of delay/offset measurements ordered by
roundtrip delay, the most accurate offsets are usually associated with
the lower delays. When selecting the best offset sample from a single
clock or when selecting the best set of clocks in an ensemble, NTP gives
the lower-delay samples greater weight in the filtering, selection and
combining procedures.

Now consider the DTS selection algorithm, which is due to Marzullo
[MAR85]. The following diagram shows two scenarios (1) and (2) involving
three clocks A, B and C. Each of the dashed lines represents the
interval of time offsets considered correct for that clock. As suggested
in [MAR85], the probability of a particular clock reading can be assumed
independent and uniformly distributed over the entire interval.

         A  +--------------------------+            A  +-+
         B  +--------------------------+            B  +-+
         C              +-+                         C  +-+

    Result              +-+                    Result  +-+

                        (1)                            (2)

According to the algorithm, the outcome is determined in both cases by
the intersection of the three intervals. However, once the intersection
has been formed, all probabilistic information of antecedent
distributions is lost. Put another way, the probability of the joint
event consisting of the intersection of all three intervals is far lower
in (1) than in (2). In NTP this information has proved highly useful in
mitigating clock selection; however, the information is lost in DTS.

3.3. Representation of Timestamp Data

Both NTP and DTS exist to provide timestamps to some specified accuracy
and precision. NTP represents time as a 64-bit quantity in seconds and
fractions, with 32 bits as integral seconds and 32 bits as the fraction
of a second. This provides resolution to about 200 picoseconds and
rollover ambiguity of about 136 years. The origin of the timescale and
its correspondence with UTC, atomic time and Julian days is documented
in [MIL90c]. DTS represents time to a precision of 100 nanoseconds,
although there appears to be no specified maximum value.

The origin of the present 136-year NTP time cycle is specified as the
first instant of the tropical year that began this century, which is an
astronomically verifiable epoch. The origin of the DTS timescale appears
to be implementation of the papal bull establishing the Gregorian
calendar in 1582, although this instant is verifiable only by historic
record. However, UTC did not exist prior to 1972 and the Gregorian
calendar did not achieve widespread use until the early years of the
twentieth century. While not specified, presumably DTS reckons the
years, leap years and Julian days of the conventional calendar as
described in the NTP specification [MIL89] and further elaborated in
[MIL90c]. In retrospect, it might have been better if both NTP and DTS
had adopted Modified Julian Day (MJD) numbering directly and avoided
tropical centuries and papal bulls altogether.

With respect to applications involving precision time data, such as
national standards laboratories, resolutions less than the 100
nanoseconds provided by DTS are required. Present timekeeping systems
for space science and navigation can maintain time to better than 30
nanoseconds, while range data over interplanetary distances can be
determined to less than a nanosecond. While an ordinary application
running on an ordinary computer could not reasonably be expected to
expect or render precise timestamps anywhere near the 200-picosecond
limit of an NTP timestamp, there are many applications where a precision
timestamp could be rendered by some other means and propagated via a
computer and network to some other place for processing. One such
application could well be synchronizing navigation systems like LORAN-C,
where the timestamps would be obtained directly from station timekeeping
equipment.

3.4. Time Zones and Leap Seconds

NTP specifically and intentionally has no provisions anywhere in the
protocol to specify time zones or zone names. The service is designed to
deliver UTC seconds and Julian days without respect to geographic
position, political boundary or local custom. Conversion of NTP
timestamp data to system format is expected to occur at the presentation
layer; however, provisions are made to supply leap-second information to
the presentation layer so that network time in the vicinity of leap
seconds can be properly coordinated. DTS includes provision for time
zones and presumably summer/winter adjustments in the form of a
numerical time offset from UTC and arbitrary character-string label;
however, it is not obvious how to distribute and activate this
information in a coordinated manner.

NTP and DTS differ somewhat in the treatment of leap seconds. In DTS the
normal growth in error bounds in the absence of corrections will
eventually cause the bounds to include the new timescale and adjust
gradually as in normal operation. Recognizing that this can take a long
time, DTS includes special provisions that expand the error bounds at
such times that leap seconds are expected to occur, which can shorten
the period for convergence significantly. However, until the correction
is determined and during the convergence interval the accuracy of the
local clock with respect to other network clocks may be considerably
degraded.

The accuracy and stability expectations of NTP preclude this approach.
In NTP the incidence of leap seconds is assumed available in advance at
all primary servers and distributed automatically throughout the
remainder of the synchronization subnet as part of normal protocol
operations. Thus, every server and client in the subnet is aware at the
instant the leap second is to take affect, and steps the local clock
simultaneously with all other servers in the subnet. Thus, the local
clock accuracy and stability are preserved before, during and after the
leap insertion.

3.5. Determining Time Offset and Roundtrip Delay

At first glance it may appear that NTP and DTS have quite different
models to determine delay, offset and error budgets. Both involve the
exchange of messages between two servers (or a client and a server).
Both attempt to measure not only the clock offsets, but the roundtrip
delay and, in addition, attempt to estimate the error. The diagrams
below, in which time flows downward, illustrate a typical message
exchange in each protocol between servers A and B.

              A          B                 A          B

              |          |                 |          |
           t1 |--------->| t2           t1 |--------->|--- t4
              |          |                 |          | |
              |          |                 |          |
              |          |                 |          | w
              |          |                 |          |
              |          |                 |          | |
           t4 |<---------| t3           t8 |<---------|---
              |          |                 |          |

                   NTP                         DTS

In NTP the roundtrip delay d and clock offset c of server B relative to
A is

                         d = (t4-t1) - (t3-t2)
                       c = ((t2-t1) + (t3-t4))/2.

This method amounts to a continuously sampled, returnable-time system,
which is used in some digital telephone networks [LIN80]. Among its
advantages are that both server A and server B can simultaneously
calculate the delay and offset knowing only the latest time of arrival
and the three preceding timestamps, which in NTP are carried with the
message and can also used for authentication purposes. the order and
timing of the messages are unimportant and reliable delivery is not
required.

In DTS server A remembers timestamp t1 (other numbered events shown in
the DTS functional specification are not shown in the diagram) and
expects server B to return t4, the time of arrival of the request from
server A, and w, the time elapsed until the departure of the response to
server A. In principle, although NTP is symmetric and DTS is not, the
two schemes are computationally equivalent and either can compute delay
and offset using similar formulas.

Both NTP and DTS have to do a little dance in order to account for
timing errors due to the precisions of the local clocks and the
frequency offsets (usually minor) over the transaction interval itself.
A purist might argue that the expression given above for delay and
offset are not strictly accurate unless the probability density
functions for the path delays are known, properly convolved and
expectations computed, but this applies to both NTP and DTS. The point
should be made, however, that correct functioning of DTS requires
reliable bounds on measured roundtrip delay, as this enters into the
error budget used to construct intervals over which a clock can be
considered correct. This is not nearly as important in NTP, since the
accuracy and stability of the local clock is largely due to the local
clock model, which is described later in this document.

4. Processing Algorithms

At the heart of any time synchronization system are the algorithms which
process the data received from possibly many servers, filter out noise
in the form of outlyers and select the best from among a population of
mutually suspicious clocks. Issues of the NTP and DTS data filtering,
clock selection and combining algorithms are compared and discussed in
following subsections.

4.1. Data Filtering

In both the NTP and DTS models a number of offsets are collected from
each of possibly many servers. In principle, the accuracy and precision
of measurements made between any pair of servers can be improved by
selecting or combining a number of sequential samples in various ways.
In NTP a comprehensive program of analysis and experiment lasting
several years and using many Internet transmission paths involving local
nets and wide-area nets resulted in what is called the minimum-filter
algorithm [MIL90b]. Reduced to essentials, this algorithm selects from
among the last n samples of delay/offset collected from a single server
the sample with minimum delay and presents the associated offset as the
time offset estimate for that server. This is done separately for each
server on a continuous basis at intervals from about one minute to about
17 minutes. As part of this continuing process, an error estimate called
the sample dispersion is constructed as the sum of weighted differences
of the resulting estimate offset relative to the other n-1 samples
considered in the selection.

In DTS the same algorithm used to select a cohort set of servers from a
population possibly including faulty servers (see next section) is used
to filter the samples from a single server. This approach was discarded
early in the NTP design experience for two reasons. First, the
statistical problem of selecting good samples from a sequence produced
by a single server has the very considerable advantage that the
underlying probability distribution can be assumed stationary and
represented by robust statistics such as produced by nonlinear trimmed-
mean filters, median filters and the NTP minimum filter. Second, the
problem of selecting good clocks from bad involves a multivariate
statistical model characteristic of pattern analysis and classification.
It has been the NTP experience that algorithms that work well on one of
these two problems usually do not do well on the other.

4.2. Clock Selection and Combining

In both NTP and DTS the problem persists that there is often no clear
distinction between truechimers and falsetickers, so that "correct"
clocks can be deduced only on a probabilistic basis and then only
according to arbitrary criteria. It could be argued on the basis of
experience, for example, that various kinds of faulty behavior are more
likely than others. For instance, it is probable that a faulty clock
more likely indicates hot ones, cold zeros or an integral number of
seconds, minutes, days or years in error, rather than fractional parts
of these quantities. A clock that comes up with no prior hint of correct
time has a vanishing probability of coming up anywhere near UTC by
simple nature of the measurement space. In NTP, for example, this would
amount to guessing the correct 256-ms window in an interval of 136
years. Interesting observations on these points, including the use of an
NTP timestamp as a cryptographic one-time pad, can be found in the
references.

NTP maintains for each server both the total estimated roundtrip delay
to the root of the synchronization subnet (synchronizing distance), as
well as the sum of the total dispersion to the root of the
synchronization subnet (synchronizing dispersion). These quantities are
included in the message exchanges and form the basis of the likelihood
calculations. Since they always increase from the root, they can be used
to calculate accuracy and reliability estimates, as well as to manage
the subnet topology to reduce errors and resist destructive timing
loops.

In NTP the selection algorithm determines one or a number of
synchronization candidates based on empirical rules and maximum-
likelihood techniques. A combining algorithm determines the local-clock
adjustment using a weighted-average procedure in which the weights are
determined by offset sample dispersion. The algorithm begins by
constructing a list of candidate clocks sorted first by stratum and then
by total synchronization dispersion to the root. The list is then pruned
from the end to a manageable size and to eliminate very noisy and
probably defective clocks. On the assumption that a valid
synchronization candidate will always be at the lowest or next from
lowest stratum, the list is truncated at the first entry where the
number of different strata on the list exceeds two. This particular
procedure and choice of parameters have been found to produce reliable
synchronization candidates over a wide range of system environments
while minimizing the "pulling" effect of high-stratum, high-dispersion
servers, especially when a large number of servers are involved.

The next step is designed to detect falsetickers or other conditions
which might result in gross errors. The pruned and truncated candidate
list is re-sorted in the order first by stratum and then by total
synchronizing distance to the root; that is, in order of decreasing
likelihood. A similar procedure is also used in Marzullo's MM algorithm
[MAR85]. Next, each entry is inspected in turn and a weighted error
estimate computed relative to the remaining entries on the list. The
entry with maximum estimated error is discarded and the process repeats.
The procedure terminates when the estimated error of each entry
remaining on the list is less than a quantity depending on the intrinsic
precisions of the local clocks involved.

The NTP selection algorithm is designed to favor those servers near the
head (maximum likelihood) of the candidate list, which are at the lowest
stratum and lowest delay and presumably can provide the most accurate
time. With proper selection of weighting factors, outlyers are discarded
from the tail of the list, unless some other entry disagrees
significantly with respect to the remaining entries, in which case that
entry is discarded first. The offsets of the surviving servers are
statistically equivalent, so any of them can be chosen to adjust the
local clock. Some implementations [MIL90c] combine them using a
weighted-average algorithm similar to that used by national standards
laboratories [ALL74], in which the offsets of the servers remaining on
the list are weighted by sample dispersion to produce a combined
estimate.

DTS uses a rather different technique, where the goal emphasized is
validated correctness relative to a set of specified criteria. A compact
way of expressing this algorithm is the following. Each clock is
expressed as an estimate C and an inherent or inherited error E, which
defines an interval [C-E,C+E]. A clock is correct if the interval
includes UTC and incorrect if not; however, it is not known in advance
which is the case. Consider M such intervals with j possibly faulty
servers and arrange the lower endpoints on a list by increasing endpoint
value. Starting from the beginning of the list, find the first point
which is contained in at least M-f intervals. This defines the lower
boundary of the correct interval. If no such point is found, increase f
by one and try again. A similar procedure is used for the upper limit of
the correct interval.

The error E used to construct the above intervals is determined both by
the intrinsic characteristics of the clock oscillator (precision), the
reading delay between the client request and server response and the
frequency offset over the interval since the oscillator was last
adjusted. Once established by the above algorithm, the correct interval
grows with time, possibly engulfing intervals previously considered
faulty. The interval between client requests is carefully computed to
prevent the correct interval from exceeding a configuration parameter.

The fundamental assumption upon which the DTS is founded is Marzullo's
proof that a set of M clocks synchronized by the above algorithm, where
no more than j clocks are faulty, delivers an interval including UTC.
The algorithm is simple, both to express and to implement, and involves
only one sorting step instead of two as in NTP. However, consider the
following scenario with M = 3, j = 1 and containing three intervals A, B
and C:

                 A  +--------------------------+
                 B  +----+
                 C                        +----+

            Result  +-----================-----+

Using the algorithm described in the DTS functional specification, both
the lower and upper endpoints of interval A are in M-j = 2 intervals,
thus the resulting interval is coincident with A. However, there remains
the interval marked "=" which contains points not contained in at least
two other intervals. The DTS document mentions this interesting fact,
but makes a quite reasonable choice to avoid multiple intervals in favor
of a single one, even if that does in principle violate the correctness
assumptions. The purist would say that a choice has to be made, either
the left intersection or the right one, perhaps mitigated by maximum-
likelihood principles. This example would seem to violate the
fundamental basis on which the proof of correctness of Marzullo's
algorithm is based.

In fact, quite similar algorithms were once used in predecessors of NTP
[MIL85a], [MIL85b], but discarded because they produced inadequate
accuracy and stability. One of the problems with algorithms such as this
is that normal variations in network path delay cause frequent occasions
when one clock or another pops in or out of one correction interval or
another, causing the interval to change size and resulting in large
phase noise of the local clock. While the phenomenon is much reduced in
the present NTP design, some "clockhopping" does occur and is the
primary contributor in NTP to clock instability. Another reason these
algorithms were abandoned was that incidental error estimates, such as
the size of the correct interval, cannot be used either for likelihood
estimation or to organize the subnet topology.

5. Local Clocks

There are fundamental differences between the NTP and DTS local-clock
models. The DTS (and Unix) model assumes the local clock runs at a rate
R determined by its integral quartz resonator, which may be manufactured
to a tolerance no better than 100 ppm, which corresponds to several
seconds per day. A correction is introduced as an offset, which causes a
number of fractional seconds to be added or subtracted from the local
clock. In order to avoid large discontinuities and insure monotonicity,
the rate at which the clock can be adjusted is fixed by an
implementation constant e (called tickadjust in the Unix kernel). The
local clock thus runs at three rates: R, R+e and R-e, so that the
magnitude of correction determines not the magnitude of the rate, but
the length of time over which the rate is continued. Once the correction
has been completed, the clock reverts to rate R and continues
indefinitely at that rate. From the DTS functional specification it
appears that the designers expect that corrections be recomputed on the
order of one every 15 minutes to once per hour.

Early in the evolution of NTP the above model was discarded as the
result of experience with sometimes broken servers and always noisy
transmission paths. The factor missing in the DTS design is a capability
to compensate for frequency errors as well as time errors. The NTP
local-clock model includes provisions to estimate the frequency error
and automatically adjust the local clock by introducing additional
offset corrections on a regular basis. This results in much reduced
frequency errors in the order of .01 ppm or a few milliseconds per day
in the absence of external corrections. In principal and depending on
the inherent stability of the local clock, the interval between
corrections can be reduced to the order of hours and even days.

However, if some or all of the servers in the synchronization subnet are
to incorporate frequency management, the clock adjustment dynamics must
be controlled and held to specified tolerances; otherwise, some servers
can become unstable and experience wild time and frequency gyrations. In
control theory the DTS design is described as a type-I feedback loop,
which is unconditionally stable, while the NTP design is described as a
type-II loop, which can become unstable under some conditions. This
requires specification of the adjustment rate, including the value of e
for Unix-style clocks, as well as a specified mechanism to adjust the
frequency as required. While this has proved a surmountable problem with
NTP daemons for Unix, it has been suggested that appropriately specified
functionality be incorporated directly into the design of the kernel
timekeeping facility, in which case DTS and possibly other schemes could
benefit as well.

For the most accurate and precise time and frequency using ordinary
hardware components it is necessary to fine-tune the adjustment dynamics
to match expected local clock jitter, wander and drift. The NTP model
incorporates this functionality using a drift estimate (kurtosis) which
dynamically adjusts the loop bandwidth. It is arguable whether diligent
pursuit of the highest quality service always justifies the additional
complexity, but it is certainly necessary if accuracies in the order of
a few milliseconds and stabilities in the order of a few milliseconds
per day are required, especially for the primary servers. Since
stability of the subnet itself is not critically dependent on this
feature, it can be considered optional in the specification and
implementation.

In point of fact, the local clock model described in the NTP
specification is listed as optional in the same spirit as the model
described in the DTS functional description. As such, the local clock
can in principle be considered implementation specific and not part of
the formal specification. However, as demonstrated above, frequency
compensation requires the local clock adjustment to be carefully
specified and implemented. The NTP mechanism has been carefully
analyzed, simulated, implemented and deployed in the Internet, but DTS
has not. The unavoidable conclusion is that NTP and DTS implementations
cannot safely interoperate in subnets of any size, unless the DTS local
clock adjustment mechanism is suitably modified.

5.2. Monotonicity

It is an uncontested fact that computer systems can be badly disrupted
should apparent time appear to warp (jump) backwards, rather than always
tick forward as our universe requires. Both NTP and DTS take explicit
precautions to avoid the local clock running backwards or large warps
when running forwards. However, both NTP and DTS models recognize that
there are some extreme conditions in which it is better to warp
backwards or forwards, rather than allow the adjustment procedure to
continue for an outrageously long time. The local clock is warped if the
correction exceeds an implementation constant, +-128 milliseconds for
NTP and ten minutes for DTS. The large difference between the NTP and
DTS values is attributed to the accuracy models assumed.

For most servers and transmission paths in the Internet a offset spike
(following filtering, selection and combining operations) over +-128
milliseconds following filtering, selection and combining operations is
so rare as to be almost negligible. For the few exceptions operating in
extreme dispersive conditions, such as statistical multiplexors or
switched landline/satellite paths, the 128-ms value can be increased by
a configuration parameter. The problem with selecting larger values is
that the time taken to affect a spike correction can be rather long,
during which the clock accuracy specification can be exceeded.
Obviously, the same considerations apply in DTS.

5.3. Epoch Determination

The DTS functional specification points out an interesting requirement,
common to other network management and routing protocols, which require
circuit breakers between a set of clocks synchronized to each other but
known to be faulty and another set synchronized to each other but known
to be correct. The problem is to avoid infection of the set of correct
clocks by timestamps from the faulty set. DTS provides this circuit
breaker in the form of an epoch number, which is incremented when a new
subnet is created. Once the first member of the new subnet has been
created, others can be transferred from the faulty to the correct subnet
one at a time so that correctness is preserved.

In NTP the circuit breaker is provided by the authentication mechanism,
which can operate with any of several encryption keys. When a new subnet
is created, all that is required is to change the key of a known correct
server and then change the keys of other servers one at a time.
Eventually all servers are running with the new key and the subnet
continues as usual. The same scheme has also been used when testing new
implementations and on occasion to isolate known falsetickers which
cannot otherwise be partitioned from the Internet.

5.4. Dynamic Polling Intervals

In both NTP and DTS servers exchange timekeeping data at regular
intervals. In NTP the polling intervals are dynamically adjusted from
about one minute to about 17 minutes, depending on sample dispersion and
local clock stability. In DTS the intervals are fixed, with lower bounds
of two minutes (servers) and 15 minutes (clerks) and upper bounds
depending on error tolerance. Relatively frequent polls are necessary
both to confirm reachability (for hot standby service), as well as to
maintain accuracy within specified limits (DTS) and to maintain optimum
subnet stability (NTP).

At the present stage of protocol refinement, NTP polling intervals can
be expected to be somewhat less than DTP intervals. The reason for this
is the emphasis in NTP on the highest attainable accuracy and stability,
which requires compensation for frequency errors as well as timing
errors. Stability of the closed-loop system without bandwidth control
presently requires a maximum polling interval in the order of one minute
for those transmission paths actually used to maintain synchronization;
however, the polling interval is increased typically to 17 minutes for
other paths, which statistically account for more than two-thirds of the
total number of paths. However, with the introduction of bandwidth
control in the latest NTP implementations, the polling interval can be
increased to 17 minutes on all paths, with the expectation of even
larger increases at the higher stratum levels.

6. Summary and Conclusions

The service objectives of both NTP and DTS are substantially the same:
to deliver correct, accurate, stable and reliable time throughout the
synchronization subnet. However, as demonstrated in this document, these
objectives are not all simultaneously achievable. For instance, in a
system of real clocks some may be correct according to an established
and trusted criterion (truechimers) and some may not (falsetickers). In
the models used by NTP and DTS the distinction between these two groups
is made on the basis of different clustering techniques, neither of
which is statistically infallible. A succinct way of putting it might be
to say that NTP attempts to deliver the most accurate, stable and
reliable time according to statistical principles, while DTS attempts to
deliver validated time according to correctness principles, but possibly
at the expense of accuracy and stability.

In both the NTP or DTS models the problem is to determine which subset
of possibly many clocks represents the truechimers and which do not. An
interesting observation about both NTP and DTS is that neither attempts
to assess the relative importance of misses (mislabelling a truechimer
as a falseticker) relative to false alarms (mislabelling a falseticker
as a truechimer). In signal detection analysis this is established by
the likelihood ratio, with high ratios favoring misses over false
alarms. In retrospect, it could be said that NTP assumes a somewhat
lower likelihood ratio than does DTS.

It might be concluded from the discourse in this document that, if the
service objective is the highest accuracy and precision, then the
protocol of choice is NTP; however, if the objective is correctness,
then the protocol of choice is DTS. However, the discussion in Section
4.2 casts some doubt either on this claim, the DTS functional
specification or this investigator's interpretation of it. It is
certainly true that DTS is "simple" and NTP is "complex," but these are
relative terms and the complexity of NTP did not result from accident.
That even the complexity of NTP is surmountable is demonstrated by the
fact that over 2000 NTP-synchronized servers chime each other in the
Internet now.

The most serious departure between NTP and DTS, and the reason that
subnets incorporating large numbers of either protocol can not
interoperate safely without further consideration, is the fact that in
NTP it has been found necessary to implement local clock frequency
compensation and in DTS it has not. Whether or not the additional rigor
in specification and implementation can be justified depends on the
expectation of the time-service customers and their applications.
Frequency compensation not only provides the capability to survive long
server outages while keeping good local time, but is instrumental in
reducing timing noise and maintaining the highest accuracy and
stability.

The widespread deployment of NTP in the Internet seems to confirm that
distributed Internet applications can expect that reliable, synchronized
time can be maintained to within about two orders of magnitude less than
the overall roundtrip delay to the root of the synchronization subnet.
For most places in the Internet today that means overall network time
can be confidently maintained to a few tens of milliseconds [MIL90a].
While the behavior of large-scale deployment of DTS in internet
environments is unknown, it is unlikely that it can provide comparable
performance in its present form. With respect to the future refinement
of DTS, should this be considered, it is inevitable that the same
performance obstacles and implementation choices found by NTP will be
found by DTS as well.

7. References

[ALL74] Allan, D.W., J.E. Gray and H.E. Machlan. The National Bureau of
     Standards atomic time scale: generation, stability, accuracy and
     accessibility. In: Blair, B.E. (Ed.). Time and Frequency Theory and
     Fundamentals. National Bureau of Standards Monograph 140, U.S.
     Department of Commerce, 1974, 205-231.

[BER87] Bertsekas, D., and R. Gallager. Data Networks. Prentice-Hall,
     Englewood Cliffs, NJ, 1987.

[DIG89] Digital Equipment Corporation. Digital Time Service functional
     specification, version T1.0.5. Digital Equipment Corporation,
     December 1989.

[LIN80] Lindsay, W.C., and A.V. Kantak. Network synchronization of
     random signals. IEEE Trans. Communications COM-28, 8 (August 1980),
     1260-1266.

[MAR85] Marzullo, K., and S. Owicki. Maintaining the time in a
     distributed system. ACM Operating Systems Review 19, 3 (July 1985),
     44-54.

[MIL85a] Mills, D.L. Algorithms for synchronizing network clocks. DARPA
     Network Working Group Report RFC-956, M/A-COM Linkabit, September
     1985.

[MIL85b] Mills, D.L. Experiments in network clock synchronization. DARPA
     Network Working Group Report RFC-957, M/A-COM Linkabit, September
     1985.

[MIL89] Mills, D.L. Network Time Protocol (Version 2) specification and
     implementation. DARPA Network Working Group Report RFC-1119,
     University of Delaware, September 1989.

[MIL90a] Mills, D.L. On the accuracy and stability of clocks
     synchronized by the Network Time Protocol in the Internet system.
     ACM Computer Communication Review 20, 1 (January 1990), 65-75.

[MIL90b] Mills, D.L. Internet time synchronization: the Network Time
     Protocol. IEEE Trans. Communications (to appear). See also: DARPA
     Network Working Group Report RFC-1129, University of Delawary,
     October 1989.

[MIL90c] Mills, D.L. The NTP Local-Clock Model and Control Algorithms.
     (unpublished), February 1990

[MIT80] Mitra, D. Network synchronization: analysis of a hybrid of
     master-slave and mutual synchronization. IEEE Trans. Communications
     COM-28, 8 (August 1980), 1245-1259.

[NBS88] Automated Computer Time Service (ACTS). NBS Research Material
     8101, U.S. Department of Commerce, 1988.

[SMI86] Smith, J. Modern Communications Circuits. McGraw-Hill, New York,
     NY, 1986.

------------------------------------------------------------------------

Date: Fri, 16 Mar 90 09:58:35 PST
From: comuzzi@took.enet.dec.com
To: mills@udel.edu
Subject: RE: DTS and NTP revisited

     ... The Digital Time Service (DTS) for the Digital Network
     Architecture (DECnet) is intended to synchronize time in computer
     networks ranging in size from local to wide-area.

You seem to be trying to clothe DTS in a propritary cloth. We now refer
to DECnet as DECnet/OSI since we've incorporated OSI protocols into the
protocol stack. It is our intention to pursue DTS in the OSI standards
forums. As such it is intended to provide service comparable to the
Network Time Protocol (NTP) for the Internet architecture.
While both are clearly addressing the same problem space, DTS and NTP
have VERY different goals. I recently spoke to the president of a time
provider manufacturer and I liked his jargon, he distinguished between
the time-of-day market and the frequency market. The time-of-day market
wants to know what time it is, it is not interested in small errors and
it doesn't want to pay a lot. The frequency market wants stable
frequency sources, needs high stability and is willing to pay.

NTP is a solution for the frequency market. DTS is only interested in
the time-of-day market. The major cost for these solutions is not the
initial capital investment, but the long term management and operation
cost. As such DTS has goals of auto-configurability and ease of
management which are not present in NTP.

     ... Local clocks are maintained at designated time servers, which
     are timekeeping systems belonging to a synchronization subnet in
     which each server measures the offsets between its local clock and
     the clocks of other servers in the subnet. In this memorandum to
     synchronize frequency means to adjust the clocks in the subnet to
     run at the same frequency, to synchronize time means to set them to
     agree at a particular epoch with respect to Coordinated Universal
     Time (UTC), as provided by national standards, and to synchronize
     clocks means to synchronize them in both frequency and time. The
     goal of a distributed timekeeping service such as NTP and DTS is to
     synchronize the clocks in all participating servers and clients so
     that all are correct, indicate the same time relative to UTC, and
     maintain specified measures of stability, accuracy and reliability.

As stated above, DTS is addressing the time-of-day market hence high
frequency stability is an not a goal of DTS.

     ... Servers, both primary and secondary, typically run NTP with
     several other servers at the same or lower stratum levels; however,
     a selection algorithm attempts to select the most accurate and
     reliable server or set of servers from which to actually
     synchronize the local clock. The selection algorithm, described in
     more detail later in this document, uses a maximum-likelihood
     clustering algorithm to determine the best from among a number of
     possible servers. The synchronization subnet itself is
     automatically constructed from among the available paths using the
     distributed Bellman-Ford routing algorithm [BER87], in which the
     distance metric is modified hop count.

Note that in DTS loops are not a problem, if a system sends out a time
an ultimately gets back a derived time, due to the communication delays
the derived time will always arrive back with a larger inaccuracy. The
only exception to this is the possiblity of a system with a time
provider and a lousy clock. Then the derived time's inaccuracy could be
smaller if the time was parked in a system with a good clock. But in
this case the network clearly has information that the original system
has lost.

     ... The NTP specification includes no architected procedures for
     servers to obtain addresses of other servers other than by
     configuration files and public bulletin boards.

This is a serious short-coming of NTP and definitely makes it harder to
manage. It is unclear to me why you haven't fixed this since it would
not seem that difficult to store server names in a namespace.
While servers passively respond to requests from other servers, they
must be configured in order to actively probe other servers. Servers
configured as active poll other servers continuously, while servers
configured as passive poll only when polled by another server. There are
no provisions in the present protocol to dynamically activate some
servers should other servers fail.

This is harder to fix and interacts with the spanning tree. Here at
least I can see why you didn't make it easier to manage. These problems
make NTP a system administrators nightmare, but are consistent with the
two different sets of goals. Consistent with DTS goals we've accepted
some "clock hopping" in exchange for ease of management.

     ... In DTS a synchronization subnet consists of a structured graph
     with nodes consisting of clerks, servers, couriers and time
     providers. With respect to the NTP nomenclature, a time provider is
     a primary server, a courier is a secondary server intended to
     import time from one or more distant primary servers for local
     redistribution and a server is intended to provide time for
     possibly many end nodes or clerks. Time providers, servers and
     couriers are evidently generic, in that all perform similar
     functions and have similar or identical internal structure.

Not only are they generic, they are dynamic. If a time provider system
loses its radio signal, it immediately reverts to a server, providing
graceful degradation in the presence of failures.

     ... The intent is that time providers can be set from radios,
     telephone calls to NIST [NBS88] or even manually.

The DTS story is actually even better here, we provide a well defined
time provider interface. This can be used to implement a time provider
without requiring modification of the protocol portions of the time
service. (On Unix systems it uses Unix domain sockets). This greatly
eases adding a new time provider, and permits time provider vendors to
supply it with their hardware. Note, NTP could (and probably should) do
this also. We have already done it.

     As in NTP, DTS clients and servers periodically request the time
     from other servers, although the subnet has only a limited ability
     to reconfigure in the event of failure.

I don't understand this statement. Reconfiguration within a LAN is about
as complete as one could imagine. The random selection of global servers
is robust against any non-partitioning WAN failures.

     On local nets DTS servers multicast to each other in order to
     construct lists of servers available on the local wire. Clerks
     multicast requests for these lists, which are returned in monocast
     mode similar to ARP in the Internet. Couriers consult the network
     directory system to find global time providers. For local-net
     operation more than one server can be configured to operate as a
     courier, but only one will actually operate as a courier at a time.

This is false, I think you're failing to distinguish between couriers
and backup couriers. There can be more than one courier per LAN, each
will always synchronize with at least one member of the global set.
Backup couriers use an election algorithm in the absence of a courier.
Only one backup courier will be elected to function as a courier.

     There does not appear to be a multicast function in which a
     personal workstation could obtain time simply by listening on the
     local wire without first obtaining a list of local servers.

That is correct, it would violate the principle that a message exchange
has to happen in order to correctly assign an inaccuracy.

     ... Both NTP and DTS exist to provide timestamps to some specified
     accuracy and precision. NTP represents time as a 64-bit quantity in
     seconds and fractions, with 32 bits as integral seconds and 32 bits
     as the fraction of a second. This provides resolution to about 200
     picoseconds and rollover ambiguity of about 136 years. The origin
     of the timescale and its correspondence with UTC, atomic time and
     Julian days is documented in [MIL90c]. DTS represents time to a
     precision of 100 nanoseconds, although there appears to be no
     specified maximum value.

The DTS time is a signed 64 bits of 100 nanoseconds since Oct 15, 1582.
It will not run out until after the year 30,000 AD. Unlike NTP which
will run out in 2036. I, for one, intend to still be alive in 2036!
There are two reasons the 100 ns. was chosen:

1) We want to use these timestamps as a time representation, for
   filesystem timestamps, etc. We REALLY don't want to deal with the
   problem that our representation is inadequate in some reasonably
   future time. Also, since the 64 bits is signed, times back to 28,000
   BC can be represented. This is potentially useful for astronomical
   data, and happily, includes all of recorded history. If we decreased
   the resolution, we would give up range. This choice seemed like a
   reasonable compromise.

2) Since we include the the transmission delay in the inaccuracy, 100 ns
   represents only 30 meters. Its not meaningful to talk about
   synchronizing clocks below that level with our algorithm. (I believe
   its not meaningful to talk about synchronizing clocks below that
   level with NTP either).

The total timestamp is 128 bits, this includes a four bit version number
field which would permit these decision to be revisited in the future.

     ... With respect to applications involving precision time data,
     such as national standards laboratories, resolutions less than the
     100 nanoseconds provided by DTS are required. Present timekeeping
     systems for space science and navigation can maintain time to
     better than 30 nanoseconds, while range data over interplanetary
     distances can be determined to less than a nanosecond. While an
     ordinary application running on an ordinary computer could not
     reasonably be expected to expect or render precise timestamps
     anywhere near the 200-picosecond limit of an NTP timestamp, there
     are many applications where a precision timestamp could be rendered
     by some other means and propagated via a computer and network to
     some other place for processing. One such application could well be
     synchronizing navigation systems like LORAN-C, where the timestamps
     would be obtained directly from station timekeeping equipment.

There is an obvious inconsistency in your position here. If you're just
using the NTP time format for synchronization, then talking about 136
year rollovers makes some sense. It could be hidden from the users by
extending the protocol. If, however, as this paragraph implies you
intend the NTP time format as a general timestamp, then there will be
extreme pain in the year 2036. (This is refered to in DEC as the
"date75" problem!) To avoid this without unduly extending the timestamp
DTS has traded off being able to use its timestamp format for certain
highly precise applications.

     NTP specifically and intentionally has no provisions anywhere in
     the protocol to specify time zones or zone names. The service is
     designed to deliver UTC seconds and Julian days without respect to
     geographic position, political boundary or local custom. Conversion
     of NTP timestamp data to system format is expected to occur at the
     presentation layer; however, provisions are made to supply leap-
     second information to the presentation layer so that network time
     in the vicinity of leap seconds can be properly coordinated. DTS
     includes provision for time zones and presumably summer/winter
     adjustments in the form of a numerical time offset from UTC and
     arbitrary character-string label; however, it is not obvious how to
     distribute and activate this information in a coordinated manner.

The information is used only as a help in user displays. That is, an
application can display BOTH the UTC time and the local time at which a
timestamp was created. It only cost 12 bits to do this. No use is made
of the timezone information by DTS or by systems.

     ... The accuracy and stability expectations of NTP preclude this
     approach. In NTP the incidence of leap seconds is assumed available
     in advance at all primary servers and distributed automatically
     throughout the remainder of the synchronization subnet as part of
     normal protocol operations. Thus, every server and client in the
     subnet is aware at the instant the leap second is to take affect,
     and steps the local clock simultaneously with all other servers in
     the subnet. Thus, the local clock accuracy and stability are
     preserved before, during and after the leap insertion.

Each server has to maintain and propagate this state before the leap
insertion. This is, of course, subject to Byzantine failures. A failing
server can insert a bad notification.

     ... In NTP the roundtrip delay d and clock offset c of server B
     relative to A is

                            d = (t4-t1) - (t3-t2)
                         c = ((t2-t1) + (t3-t4))/2.

     This method amounts to a continuously sampled, returnable-time
     system, which is used in some digital telephone networks [LIN80].

The derivation of the expression for 'c' above assumes the two transit
delays for this exchange are symmetric. If there are systematically
asymmetric transmission delays then the NTP algorithm will shift the two
clocks so that they appear to be synchronized, when in fact they are
systematically off by some number of milliseconds. The NTP minimum
filter attempts to minimize this effect assuming that the shortest round
trip exchange would have to be symmetric or nearly so. Unfortunately
quite large systematic asymmetric delays can occur for a variety of
reasons: source-routed networks, broken routing tables, etc. and these
would apply to all transactions including the shortest. This problem
exists in DTS also, but in DTS both of the systems will have an
inaccuracy which encompasses the correct time. That is, DTS will not
claim to have synchronized clocks to a level which it has not, even in
the presence of asymmetric delays. NTP can and has.

     ... Both NTP and DTS have to do a little dance in order to account
     for timing errors due to the precisions of the local clocks and the
     frequency offsets (usually minor) over the transaction interval
     itself. A purist might argue that the expression given above for
     delay and offset are not strictly accurate unless the probability
     density functions for the path delays are known, properly convolved
     and expectations computed, but this applies to both NTP and DTS.
     The point should be made, however, that correct functioning of DTS
     requires reliable bounds on measured roundtrip delay, as this
     enters into the error budget used to construct intervals over which
     a clock can be considered correct.

However, this is not at all hard to compute. Simply increase the
inaccuracy by the potential drift of the local clock during the
transaction. The architecture specifies this.

     ... NTP maintains for each server both the total estimated
     roundtrip delay to the root of the synchronization subnet
     (synchronizing distance), as well as the sum of the total
     dispersion to the root of the synchronization subnet (synchronizing
     dispersion).

This synchronizing distance has a rather loose definition. I believe the
current NTP RFC suggests using ten times the mean expected error for the
synchronizing distance. If this parameter is important to the NTP
algorithm I would expect some stronger specification. Also, where does
the value ten come from? I know its experimentally derived and seems to
work

     ... These quantities are included in the message exchanges and form
     the basis of the likelihood calculations. Since they always
     increase from the root, they can be used to calculate accuracy and
     reliability estimates, as well as to manage the subnet topology to
     reduce errors and resist destructive timing loops.

While you state the synchronizing distance and sychronizing dispersion
can be used to calculate accuracy, I have never seen a derivation of how
this could be done. This is one of the recurring points, the lack of
formal proofs.

     ... The next step is designed to detect falsetickers or other
     conditions which might result in gross errors. The pruned and
     truncated candidate list is re-sorted in the order first by stratum
     and then by total synchronizing distance to the root; that is, in
     order of decreasing likelihood. A similar procedure is also used in
     Marzullo's MM algorithm [MAR85]. Next, each entry is inspected in
     turn and a weighted error estimate computed relative to the
     remaining entries on the list. The entry with maximum estimated
     error is discarded and the process repeats. The procedure
     terminates when the estimated error of each entry remaining on the
     list is less than a quantity depending on the intrinsic precisions
     of the local clocks involved.

A point which is not discussed here is that when NTP chooses to prune an
entry, it can not determine if this entry's problem is that it comes
from a bad clock (falseticker in your jargon), or experienced unusually
large and asymmetric network delays. The latter case is something to be
expected in normal operation, the former represents a problem which
should be fixed. DTS uses the interval information to identify such bad
clocks, and reports them. Since if a clocks interval doesn't intersect
the majority it is clearly faulty. This is, of course, a MAJOR issue in
distributed system management.

     ... The fundamental assumption upon which the DTS is founded is
     Marzullo's proof that a set of M clocks synchronized by the above
     algorithm, where no more than j clocks are faulty, delivers an
     interval including UTC. The algorithm is simple, both to express
     and to implement, and involves only one sorting step instead of two
     as in NTP. However, consider the following scenario with M = 3, j =
     1 and containing three intervals A, B and C:

                         A  +--------------------------+
                         B  +----+
                         C                        +----+
     

                    Result  +-----================-----+

     Using the algorithm described in the DTS functional specification,
     both the lower and upper endpoints of interval A are in M-j = 2
     intervals, thus the resulting interval is coincident with A.
     However, there remains the interval marked "=" which contains
     points not contained in at least two other intervals. The DTS
     document mentions this interesting fact, but makes a quite
     reasonable choice to avoid multiple intervals in favor of a single
     one, even if that does in principle violate the correctness
     assumptions.

Come on, this in no way violates the correctness assumption. The proofs
tell us that the correct time is somewhere in the two dashed sub-
intervals. By making the statement that the time is somewhere in the
larger interval, a server is making a WEAKER assertion. Marzullo's proof
would apply and the algorithm would work (sub-optimally) if servers
arbitrarily lengthened the intervals they computed.

     ... In point of fact, the local clock model described in the NTP
     specification is listed as optional in the same spirit as the model
     described in the DTS functional description. As such, the local
     clock can in principle be considered implementation specific and
     not part of the formal specification.

This is a rather odd statement. What I read is that the local clock
model is not explicitly required by the NTP documents, but it is, in
fact, required in functioning implementations.

     However, as demonstrated above, frequency compensation requires the
     local clock adjustment to be carefully specified and implemented.
     The NTP mechanism has been carefully analyzed, simulated,
     implemented and deployed in the Internet, but DTS has not.

I have never read a clear specification of the required quality of the
input time to NTP. However, the following argument shows that in a LAN
of typical machines, DTS can indeed provide time to NTP. The clock
resolution of most machines is between 1 and 16.7 milliseconds. Thus,
any single measurements made by NTP MUST experience this clock jitter.
NTP can achieve better overall results only by averaging many such
measurements. We have measured the 'jitter' of DTS times in LANs, it is
less than 10 milliseconds, so if DTS supplies time to NTP in a typical
LAN, the NTP will receive time similar in quality to the time it gets
from other NTP servers. In the WAN case, the jitter may be a problem, I
assume that to interoperate in the presence of WAN links may require
clock training.

If you could provide the derivation of accuracy from synchronization
distance and synchronization dispersion that you allude to in section
4.2, this could form the basis of reliable interoperation with NTP
supplying time to DTS. Alas, I suspect such a derivation is
unachievable. However, for installations which are not concerned with
the DTS guarantee, the time provider interface could be used to import
NTP time into DTS (just like any time provider, though there would have
to be a user supplied inaccuracy, based on local experience with NTP).
We intend to include a sample time provider program to permit this.

     ... It is an uncontested fact that computer systems can be badly
     disrupted should apparent time appear to warp (jump) backwards,
     rather than always tick forward as our universe requires. Both NTP
     and DTS take explicit precautions to avoid the local clock running
     backwards or large warps when running forwards. However, both NTP
     and DTS models recognize that there are some extreme conditions in
     which it is better to warp backwards or forwards, rather than allow
     the adjustment procedure to continue for an outrageously long time.
     The local clock is warped if the correction exceeds an
     implementation constant, +-128 milliseconds for NTP and ten minutes
     for DTS. The large difference between the NTP and DTS values is
     attributed to the accuracy models assumed.

I believe the difference also comes from different assumptions of the
risks (and probabilistic costs) involved in jumping the clock. We assume
it is something you want to do rarely.

     For most servers and transmission paths in the Internet a offset
     spike (following filtering, selection and combining operations)
     over +-128 milliseconds following filtering, selection and
     combining operations is so rare as to be almost negligible.

The duplicated text makes me think there is something wrong here, though
frankly I don't understand what this paragraph is trying to say.

     ... The service objectives of both NTP and DTS are substantially
     the same: to deliver correct, accurate, stable and reliable time
     throughout the synchronization subnet. However, as demonstrated in
     this document, these objectives are not all simultaneously
     achievable. For instance, in a system of real clocks some may be
     correct according to an established and trusted criterion
     (truechimers) and some may not (falsetickers). the models used by
     NTP and DTS the distinction between these two groups is made on the
     basis of different clustering techniques, neither of which is
     statistically infallible. A succinct way of putting it might be to
     say that NTP attempts to deliver the most accurate, stable and
     reliable time according to statistical principles, while DTS
     attempts to deliver validated time according to correctness
     principles, but possibly at the expense of accuracy and stability.

I would claim you're understating DTS's goals of autoconfigurability and
managablility.

     In both the NTP or DTS models the problem is to determine which
     subset of possibly many clocks represents the truechimers and which
     do not. An interesting observation about both NTP and DTS is that
     neither attempts to assess the relative importance of misses
     (mislabelling a truechimer as a falseticker) relative to false
     alarms (mislabelling a falseticker as a truechimer). In signal
     detection analysis this is established by the likelihood ratio,
     with high ratios favoring misses over false alarms. In retrospect,
     it could be said that NTP assumes a somewhat lower likelihood ratio
     than does DTS.

I'm not sure I understand your jargon here. The important trade off for
DTS is to notify managers of broken clocks (calling a falseticker a
falseticker) so that it can be fixed. Declaring a good clock bad
(labeling a truechimer a falseticker) could only occur in DTS as an
implementation error or as a massive multi-server failure. In either
case a human will have to get involved.

     It might be concluded from the discourse in this document that, if
     the service objective is the highest accuracy and precision, then
     the protocol of choice is NTP; however, if the objective is
     correctness, then the protocol of choice is DTS. However, the
     discussion in Section 4.2 casts some doubt either on this claim,
     the DTS functional specification or this investigator's
     interpretation of it.

I believe you are doing your position a disservice by raising this red-
herring. No one has found your argument that DTS violates the
assumptions of Marzullo's thesis convincing. Lamport commented that it
indicates a serious misunderstanding of Marzullo's proof.

     It is certainly true that DTS is "simple" and NTP is "complex," but
     these are relative terms and the complexity of NTP did not result
     from accident. That even the complexity of NTP is surmountable is
     demonstrated by the fact that over 2000 NTP-synchronized servers
     chime each other in the Internet now.

The ever decreasing cost of time providers argues heavily for a simple
solution, even though it may require more time providers. It simply
isn't worth a lot of software complexity, (and maintenance cost, and
management cost) to avoid spending a few dollars to buy more providers.
Further, the philosophy of 'correctness' leads to certifiable
implementation by independent vendors.

     ... The widespread deployment of NTP in the Internet seems to
     confirm that distributed Internet applications can expect that
     reliable, synchronized time can be maintained to within about two
     orders of magnitude less than the overall roundtrip delay to the
     root of the synchronization subnet. For most places in the Internet
     today that means overall network time can be confidently maintained
     to a few tens of milliseconds [MIL90a]. While the behavior of
     large-scale deployment of DTS in internet environments is unknown,
     it is unlikely that it can provide comparable performance in its
     present form. With respect to the future refinement of DTS, should
     this be considered, it is inevitable that the same performance
     obstacles and implementation choices found by NTP will be found by
     DTS as well.

I disagree with this final paragraph. I think that NTP and DTS both
attain their very different goals. Our difference of opinion is in how
important the different goals are. I accept that DTS will not keep
clocks quite as tightly synchronized as NTP. It will, however, be a
product that a vendor can confidently ship to customers who are expected
to install, configure and manage it themselves.

------------------------------------------------------------------------

Date:     Mon, 19 Mar 90 14:19:44 EST
From:     Dennis Ferguson <dennis@gw.ccie.utoronto.ca>
To:  Mills@udel.edu, comuzzi@took.dec.com, elb@mwunix.mitre.org,
marcus@osf.org
Subject: Re:  Review and comment on comparison of DTS and NTP

I have avoided more than a brief comment on DTS since my copy was
compressed two pages onto a page and then FAXed to a really rotten FAX
machine here. It is half unreadable, and I have been a little reluctant
to get involved in this for fear I might attribute to DTS shortcomings
which are dealt with in the parts I can't read. I don't, however, feel
my particular concerns about DTS have been adequately dealt with by what
has passed so far.

As a consumer of time I care only about results, and what is required of
me to achieve them. In particular I care about three things: that my
computer's system clock be as accurate as possible as much of the time
as possible for a reasonable expenditure of CPU cycles and network
bandwidth, that I not have to work too hard to achieve this, and that I
have some way of verifying the truthfulness of the time I am receiving.
Note that how accurate is "accurate" is hardly quantifiable, the clock
should provide the time as accurately as is possible within the
constraints stemming from hardware and network deficiencies. Similarly,
I would like to do no work to achieve this other than starting a
program. The "truthfulness" part is important since one of the major
application groups which will require time synchronization will no doubt
be authentication protocols (e.g. SNMP authentication, and Kerberos to
some degree) and I don't want to leave a hole for attacking these
through the time protocol.

In this light I'd like to make some comments on the recent round of
debate concerning DTS versus NTP.

     NTP is a solution for the frequency market. DTS is only interested
     in the time-of-day market. The major cost for these solutions is
     not the initial capital investment, but the long term management
     and operation cost. As such DTS has goals of auto-configurability
     and ease of management which are not present in NTP.

This is blatantly misleading. NTP is more accurate than DTS because it
includes computational machinery to condition the local clock, which DTS
lacks. DTS includes a sub-protocol for autoconfiguring a large portion
of the synchronization subnet, NTP does not (or at least not on the
scale of DTS). All this is true.

What is misleading is the strong implication that these issues are
somehow related. They are completely and utterly orthogonal. NTP could
be enhanced with an auto-configuration protocol, and indeed could use a
variant of DTS' scheme, without affecting the precision it achieves one
wit. Similarly, DTS' omission of the local clock machinery was in no way
necessitated by its ability to auto-configure, nor any other aspect of
the protocol which I can fathom. DTS just left this part out, and as a
consequence is sloppier.

The "time-of-day market" versus "frequency market" analogy is hence
quite faulty, I think. I can see no cost at all associated with NTP's
precision, perhaps other than requiring additional work on the part of
the implementer of the software.

     As stated above, DTS is addressing the time-of-day market hence
     high frequency stability is an not a goal of DTS.

Again, this is so silly. As far as I can see, DTS has gained exactly
nothing by leaving out the local clock conditioning code, but has lost
an order of magnitude or more in accuracy under normal conditions and
several orders of magnitude should the subnet partition and cause loss
of reachability of the radio clock servers. Just saying this "is not a
goal of DTS" with the "time-of-day market" irrelevancy thrown in does
not make this reasonable.

This from Dave Mills, followed by Joe Comuzzi, followed by Dave Mills:

          There does not appear to be a multicast function in which a
          personal workstation could obtain time simply by listening on
          the local wire without first obtaining a list of local
          servers.

     That is correct, it would violate the principle that a message
     exchange has to happen in order to correctly assign an inaccuracy.

          There appears to be a considerable Internet constituency which
          has noisily articulated the need for a multicast function when
          the number of clients on the wire climbs to the hundreds.
          Having responded to the articulation noise, I thought it might
          be a reasonable idea to include this capability (so far
          untested) on LANs with casual workstations, promiscuous
          servers and simple protocol stacks.

This capability is certainly not untested. My NTP daemon does
multicasting, and I synchronize about 80 machines here this way (I also
note that 8/9ths of the computers which make up the NSFnet backbone are
also time synchronized with broadcast NTP. I have no idea how many other
users there are). The clients which are synchronized this way require no
configuration and, indeed, the scheme scales well since I could
synchronize 100's of machines this way at no additional expense. The NTP
clock filter is adaptable for use with one-way time, the selection
algorithms continue to work as always, and systematic time skews between
machines with precise clocks are on the order of a few milliseconds (and
stably so. This is NTP, after all). Truthfully, while I like the DTS
approach to clock combination, I'm not sure I care about knowing the
inaccuracy enough to miss knowing it on hosts at the bottom of the
synchronization tree. Multicast time seems to me to be a feature the
"time-of-day market" could truly make good use of.

     The DTS time is a signed 64 bits of 100 nanoseconds since Oct 15,
     1582. It will not run out until after the year 30,000 AD. Unlike
     NTP which will run out in 2036. I, for one, intend to still be
     alive in 2036!

Note that this comment is only relevant if one requires that time stamps
for long lived things like files be identical to timestamps used by the
time synchronization protocol. I see no reason to require this, about
the only thing you save are a couple of conversion routines. If this
isn't a requirement then the only thing NTP has to worry about are
packets which spend more than 136 years in transit, and that there be
some external time source which allows you to determine the time-of-day
to within 68 years before running the protocol. I can't get too excited
about this.
     2) Since we include the the transmission delay in the inaccuracy,
        100 ns represents only 30 meters. Its not meaningful to talk
        about synchronizing clocks below that level with our algorithm.
        (I believe its not meaningful to talk about synchronizing clocks
        below that level with NTP either).

I have some 68000-based hardware with microsecond system clocks which
have an on-time second output I can look at with an oscilloscope, or
plot on a chart recorder. The clocks have ovenized crystal oscillators.
I see synchronization between the machines, via NTP across an ethernet,
on the order of 50-75 us (after calibrating out systematic asymmetric
delays in the code). This is hardly trying, either. I don't think 100 ns
is particularly small. GPS can give me UTC more precisely than that, I
don't think it unreasonable to expect time synchronization protocols to
be prepared to move time with precisions I can get today, let alone 20
years from now.

From Dave Mills, followed by Joe Comuzzi:

          Both NTP and DTS have to do a little dance in order to account
          for timing errors due to the precisions of the local clocks
          and the frequency offsets (usually minor) over the transaction
          interval itself.

     However, this is not at all hard to compute. Simply increase the
     inaccuracy by the potential drift of the local clock during the
     transaction. The architecture specifies this.

What bugs me about this is that increasing the inaccuracy by the
potential drift (it appears to me the latter must be configured as well,
since DTS doesn't seem to include any machinery to determine it on the
fly. DTS may not be quite as "manageable" as one might believe) may make
the protocol work nicely, but doesn't do a damn thing for the precision
of my drifting system clock. The latter is what I'm paying for a time
synchronization protocol for, the fact that DTS can tell me it is doing
a bad job by giving me error bounds doesn't excuse the fact that it is
doing a bad job.

It appears to me that the NTP local clock code is a natural for DTS. It
avoids having to configure an expected drift to get a realistic
estimate. It essentially reduces the drift by more than an order of
magnitude. Since the local clock algorithm is analitically well defined
it should be quite possible to have it produce a meaningful inaccuracy
for use by the protocol automatically (the compliance estimate is close
already). The inclusion of the local clock conditioning code would
affect little else that I can see. Why didn't DTS include it?

     A point which is not discussed here is that when NTP chooses to
     prune an entry, it can not determine if this entry's problem is
     that it comes from a bad clock (falseticker in your jargon), or
     experienced unusually large and asymmetric network delays. The
     latter case is something to be expected in normal operation, the
     former represents a problem which should be fixed. DTS uses the
     interval information to identify such bad clocks, and reports them.
     Since if a clocks interval doesn't intersect the majority it is
     clearly faulty. This is, of course, a MAJOR issue in distributed
     system management.

The possibility of separating broken clocks from broken networks is a
neat feature of DTS' approach, and one much to be desired. I covet this
ability. But why, oh why, was local clock conditioning ignored and the
accuracy of the system clock compromised? I covet the latter more
intensely, and can't understand why I can't have both.

     I have never read a clear specification of the required quality of
     the input time to NTP. However, the following argument shows that
     in a LAN of typical machines, DTS can indeed provide time to NTP.
     The clock resolution of most machines is between 1 and 16.7
     milliseconds. Thus, any single measurements made by NTP MUST
     experience this clock jitter. NTP can achieve better overall
     results only by averaging many such measurements. We have measured
     the 'jitter' of DTS times in LANs, it is less than 10 milliseconds,
     so if DTS supplies time to NTP in a typical LAN, the NTP will
     receive time similar in quality to the time it gets from other NTP
     servers. In the WAN case, the jitter may be a problem, I assume
     that to interoperate in the presence of WAN links may require clock
     training.

It is incorrect that the NTP local clock must experience a jitter of the
magnitude of the precision of the remote machine's clock, since this
data is passed through the filter algorithm before it reaches the clock.

What mystifies me is where the 10 milliseconds comes from and how this
is typical. I am on thin ice here, but let me expose my ignorance by
making some assertions based on what I can decode from the DTS spec. DTS
seems to be designed to deal with clock drifts in the 100 ppm ballpark.
It appears to me that the clock is updated once in 15 minutes (??).
Without clock conditioning, this would imply that a DTS synchronized
clock may jitter by 90 ms over the update period, and that on average
the clock will be 45 ms off (this may be incorrect, but I see nothing at
all in there which compensates for drift with predictive adjustments).

This is what is alarming, an NTP-corrected clock likely won't drift by
90 ms if left without synchronization for 2 or 3 days, and under normal
operation when synchronized across a LAN will on average be right dead
on (or, at least, show systematic offsets in the sub-milliseconds which
are more related to code path lengths and such). There is no reason why
DTS couldn't match this performance.

Now, either your 10 ms implies that the "typical" clock you tested with
had an inherent drift of less than 10 ppm, or that I am grossly mistaken
and the clock update interval is more like a minute-and-a-half (implying
a lot of traffic?). If the latter is true, I apologize. If the former is
true, however, I would suggest you may be in for a big shock when you
try to run your protocol in the "real world". From data taken from about
275 machines which run NTP here, we see an *average* drift of 30 ppm
slow, with quite a large standard deviation. Fully 5% of those machines
have drift rates greater than the 100 ppm (I note that none of these are
built by DEC, which may be why DTS has a more optimistic view of the
world). There are six workstations in the room next door which drift at
a rate of 300-350 ppm fast. NTP handles all of these (though it was
helped with the 300 ppm stations by some priming), and the local clock
effectively reduces the drift rate to a ppm or less in almost all cases.
DTS leaves this part grossly underengineered.

          It is an uncontested fact that computer systems can be badly
          disrupted should apparent time appear to warp (jump)
          backwards, rather than always tick forward as our universe
          requires.

     I believe the difference also comes from different assumptions of
     the risks (and probabilistic costs) involved in jumping the clock.
     We assume it is something you want to do rarely.

Both Dave and Joe miss a fundamental point here. There is nothing in the
NTP spec which requires the system clock (as opposed to NTP's local
clock) to step into a +-128 ms window, or a +-512 ms window, or into a
15 minute window for that matter. NTP need never step the system clock
backwards. There is a compilation option to my daemon which causes it to
slew the system clock under all conditions, as this closes a hole when
used with Kerberos. The performance when done this way is very nearly
identical to the performance when the system clock is allowed to step.
NTP's stepping of the system clock is an absolute non-issue, it can do
it any way you prefer.

          A succinct way of putting it might be to say that NTP attempts
          to deliver the most accurate, stable and reliable time
          according to statistical principles, while DTS attempts to
          deliver validated time according to correctness principles,
          but possibly at the expense of accuracy and stability.

     I would claim you're understating DTS's goals of
     autoconfigurability and manageability.

I would claim that all of the above is misleading.

(1) There is no reason that I can see why DTS' accuracy and stability
    couldn't be improved to NTP levels without violating correctness
    principles. Indeed, such an enhancement could only improve DTS'
    error bound estimate since it seems to me the local clock could be
    used to automatically produce estimates of things that need to be
    configured now.

(2) NTP's accuracy and stability in no way preclude additions to the
    protocol to ease configuration and management. The latter just
    hasn't been done yet.

(3) DTS' autoconfigurability and manageability have nothing to do with
    its ability to achieve NTP's level of performance, or lack thereof.
    The computational machinery required to do the latter was omitted
    for no good reason that I can see.

     The ever decreasing cost of time providers argues heavily for a
     simple solution, even though it may require more time providers. It
     simply isn't worth a lot of software complexity, (and maintenance
     cost, and management cost) to avoid spending a few dollars to buy
     more providers. Further, the philosophy of 'correctness' leads to
     certifiable implementation by independent vendors.

          I continue to believe it is not constructive to "certify
          correctness" in probabilistic systems, only to exchange
          acceptable tolerance bounds for acceptable error bounds. If by
          "time providers" you imply each is associated with a radio
          clock, I do not think it likely that the cost of a radio clock
          will plummet to the point that every LAN can afford one and,
          even if it did, you can not trust a single radio. You have to
          have more than one of them and, preferably, no common point of
          failure between them.

I find myself in agreement, and disagreement, with both of these points
of view. I am personally a believer that every LAN should have a "time
provider" or three, and that the only thing which prevents this from
happening is a chicken-and-egg problem (radio clocks are low volume
items and hence are expensive. Radio clocks are expensive, so not a lot
of people want to buy them).

Again and again, however, the issue of the maintenance and management
cost versus performance tradeoff rears its ugly head. *There* *is* *no*
*such* *tradeoff*, the issues are orthogonal. Moreover, since the local
clock processing is utterly divorced from the rest of the protocol, and
since NTP's local clock is far and away the part of the spec most
solidly supported by analysis, I can see no reason whatsoever that its
inclusion in DTS would affect one's ability to produce certifiable
implementations in any way.

          ... The widespread deployment of NTP in the Internet seems to
          confirm that distributed Internet applications can expect that
          reliable, synchronized time can be maintained to within about
          two orders of magnitude less than the overall roundtrip delay
          to the root of the synchronization subnet. For most places in
          the Internet today that means overall network time can be
          confidently maintained to a few tens of milliseconds [MIL90a].
          While the behavior of large-scale deployment of DTS in
          internet environments is unknown, it is unlikely that it can
          provide comparable performance in its present form. With
          respect to the future refinement of DTS, should this be
          considered, it is inevitable that the same performance
          obstacles and implementation choices found by NTP will be
          found by DTS as well.

     I disagree with this final paragraph. I think that NTP and DTS both
     attain their very different goals. Our difference of opinion is in
     how important the different goals are. I accept that DTS will not
     keep clocks quite as tightly synchronized as NTP. It will, however,
     be a product that a vendor can confidently ship to customers who
     are expected to install, configure and manage it themselves.

Again the implication that a time protocol which is configurable and
manageable cannot be precise. It is not that DTS cannot be precise, it
is that it is not precise. I still have yet to see even one clear,
understandable advantage which has been gained by not making DTS
precise.

If I had to choose, I'd send both protocols back to the drawing board
for further revision. NTP badly needs some work done in the area of
auto-configuring large synchronization subnets, since this can be
painful. DTS compromises its precision by ignoring relatively simple,
straight forward, analitically sound techniques for improving the
behaviour of the local clock, a deficiency which buys it nothing in
other areas that I can see.

DTS' "correctness" philosophy is truly attractive to me. If we were
comparing paper protocols I would rate DTS a hands down winner. The
thing is, "correctness" carries less weight with me in the comparison to
NTP since NTP is a protocol derived from long practical experience and
which is known to work well in the real world. Unless I am
misunderstanding something (a distinct possibility), the "10 ms typical"
problem may indicate that DTS' idea of what the world is like might not
be based on wide experience. I like DTS, though. I just wish the
spurious omission of computational machinery to deal properly with the
local clock were fixed before international standardhood forces sloppy
timekeeping (or, at least, a lot sloppier than it has any good reason to
be) on us all.

The only other concern I had was related to authentication. I just
wanted to make sure either that DTS was only being targetted for the OSI
environment or that it was carrying over all the related authentication
baggage required into the Internet environment. Given the dependence of
other Internet protocols' authentication schemes on the security (and
synchronization, even) of the system clock, I think an Internet time
protocol which lacks authentication will be unusable in a rapidly
growing number of situations (of course, NTP could really use some help
in the area of key management, regardless).

------------------------------------------------------------------------

Date: Wed, 21 Mar 90 14:10:08 PST
From: comuzzi@took.enet.dec.com
To: mills@udel.edu
Subject: I've sent this to everyone else, yours bounced because of a
typo.

This is a note to continue the DTS/NTP comparison, because I too am
finding this conversation fruitful. Dennis, I would be glad to mail you
a copy of the DTS architecture if you want one. (and Dave, I've even
changed the cover and introduction).

I'd like to address some of the issues Dennis raised. I'll save his
major point, about accuracy and ease-of-managment being orthogonal, for
last. Allow me to start with the decision of DTS not to support a
multicast mode. One reason was that protocols which multicast the time
will be subject to Byzantine failures. (Clearly anyone can just
multicast any time they want. This problem does not occur with using
multicast to locate the servers in the DTS architecure, multiple servers
would have to be co-opted.) The second point, the one I was trying to
raise in my reply to Dave, was that it was DTS's intention to have the
time and interval information available on every node. The hope was that
this would permit the proliferation of applications and algorithms which
used the DTS guarantee (that UTC is contained in the interval). Clearly,
until such a facility is available on every system, few are going to
spend a lot of time exploring such issues. One such application is in
DECnet/OSI phase V management. Events are logged with their time and
inaccuracy. This permits a network admistrator to examine log entries
and determine that one event could not have caused a second event (if,
for instance, the interval of the first event doesn't overlap and is
later than the second.) Obviously this requires the inaccuracy to be
available on every source of events, that is, every node. The third
point I'd like to discuss refers to Dave's statement about a
"considerable Internet constituency which has noisily articulated the
need for a multicast function when the number of clients on the wire
climbs to the hundreds." Is it that they wanted multicast? Or was the
real objection the practical difficulty of adding a second server to a
LAN of 300 nodes and then having to change the server entry in 150
ntp.conf files to redistribute the load? (This is a concrete example of
why I claim NTP is a system admistrator's nightmare. However, my real
interest in this discussion is to separate the problem 
(autoconfiguration) from the particular solution that was chosen
(multicast the time), and to understand what the motivation of the
Internet community was.) Clearly DTS responds to the autoconfiguration
problem, but if the requirement really was that they didn't want to add
that additional server at all then a multicast scheme is probably the
only solution. However, one should be clear about what is being traded-
off, a multicast solution will manifestly have Byzantine failures. I
think this difference is an interesting one and would like to have more
discussions about it.

The next topic I'll discuss is the area of the DTS architecture I
personally believe is least likely to survive the standardization
process unchanged: The timestamp format. I think we have much agreement
here. I defended the 100 nanosecond resolution of the DTS timestamp,
though I'm not particularly happy with it either. I do think it would be
a win if the network timestamp format equaled the internal timestamp
format, which argues that the network time format wants to be long
lived. Even if this argument is rejected, however, there is still a
price to be paid when the network timestamp runs out - implementations
which haven't added the rollover code will be unable to operate. The
question here is how low-level, pervasive and long-lived network nodes
might become (I've heard suggestions of putting the OSI protocol stack
in a thermostat, I suppose it could happen.) It's true that sitting here
today, it's hard to imagine that any code I write will still be
executing in 2036, butif I were blasting it into ROM I'd be less sure
about that. I guess I'm also attracted to Dave's suggestion of using
Julian Day numbering and fractions within the Julian Day, though when I
try to plug in the numbers and actually design the timestamp I'm forced
to conclude the total timestamp would have to be a bit bigger. Since
I've already heard grumbling about the size of DTS's 128 bit timestamp,
I guess I'll have to think about this further.

There has been a fair amount of discussion about interoperation of the
two time services. Let me try to clarify what I said (too tersely) in my
original response to Dave's paper. There are three separate cases I'd
like to distinguish

     A) An isolated group of DTS systems which obtain time from NTP
     B) An isolated group of NTP systems which obtain time from DTS.
     C) A collection of DTS and NTP giving time to each other.

I believe that without fairly major changes in one or both of the
architectures case C is a problem, however it is an easily prevented
problem (see below). Cases A and B however are very interesting, useful
and I claim easily achievable.

NTP giving time to DTS would make sense in environments where DTS
systems are being added to LANs where there is no local time provider
and there is a pre-existing NTP infastructure. The model for how to do
this would be to use the DTS time provider interface to import the NTP
time. Comparison of DTS timestamps within the DTS group would work
normally; there would, however, be some risk of getting an incorrect
result if timestamps of two such groups were intercompared. This risk
can be managed though, the interoperation would require the user specify
an inaccuracy for the NTP time based on the local experience with NTP,
the obvious choice would be some multiple of the synchronization
distance - the more risk adverse the larger the multiple. This would (at
low probability) violate the DTS correctness philosophy. If you want
certainty, buy two hardware time providers per LAN. Note, however, a
reasonable compromise would be one hardware time provider and an NTP
time provider. The DTS fault detection would check the hardware against
NTP and complain when there was a discrepancy.

Case B corresponds to situations where a DTS infastructure exists (or is
being added) and there is a need to deliver time to systems using NTP.
This is the situation I was talking about in my response to Dave. I made
several assumptions, which I didn't make clear. I assumed that the
gateway would be both a DTS server and an NTP server, that the NTP code
would be operating in the "don't change the clock" mode (In the U or
Maryland implementation this is the -s option, I'm told Dennis's
implementation has an option in the ntp.conf file to do the same thing.)
and that the NTP clients were on the same LAN. DTS servers synchronize
with each other at two minute intervals (Note this is server to server -
clients synchronize at 15 minute intervals. This is Dennis's question.)
Now, I'm not claiming that "the NTP local clock must experience a jitter
of the magnitude of the precision of the remote machine's clock," what I
am claiming is that in normal NTP operation the input data to the NTP
filtering algorithm must experience a jitter of the magnitude of the
precision of the remote machine's clock. This is the same magnitude as
the jitter of the DTS managed clock. I have conducted experiments with
this configuration and haven't experienced any wild instabilities.
Again, NTP timestamps generated in two such groups will not be
intercomparable with each other to the level that they would if the time
was being delivered by NTP all the way. Currently however, no
application can be algorithmicly depending on distributed time derived
from NTP unless the algorithm contains a parameter which says "times
closer together then this will be assumed unordered", that is, unless
the algorithm imputes an inaccuracy to the NTP time.

Case C potentially breaks the invariants of both protocols. The DTS
invariant is that UTC is contained within the interval. The NTP
invariant (I'm less sure of my statement here) is that the frequency of
good servers agree with UTC. NTP has a further invariant, that there are
no loops in the time distribution network. This is enforced by the
stratum. Clearly if DTS took time from a collection of NTP servers and
later gave it back to the same collection of servers, a loop could and
probably would occur. There is a simple method to prevent this, I
propose that the gateway described in case B above always declare itself
to be at some fairly high (numerically large) stratum. Potential clients
will ignore the DTS/NTP server in favor of servers which obtained their
time exclusively via NTP (and have much lower stratum numbers). I'm
assuming that the NTP implementation at the gateway can be coerced into
using a fixed stratum and would propose a value of 16 for this purpose.
There's also a stratum zero which is supposed to be used when the
stratum is unknown, however I'm not sure I understand what value servers
which obtain their time from a stratum zero server will use. Do they use
zero? If so, how are loops prevented amongst themselves?

A few nits before we get to the meat of the discussion. Dennis is
concerned that the drift has to be input as a management parameter. I'll
show my vendor colors here and say this isn't a management parameter but
an implementation detail. The assumption is that when DTS is shipped to
you from your hardware or software vendor, the drift has been
autoconfigured. That is, a good implementation of the DTS architecture
would know the maximum drift rate for the machines that the
implementation has been certified on. Until some sort of architecture
neutral distribution format (OSF is attempting to settle on such an
ANDF) is created this is really easy to do - Just hard code in the value
that's appropriate for the given processor family. If there's wide
variation within the processor family (or if there are multiple
processor families due to an ANDF), you'll have to code a table. About
the only case where the user would have to enter it is where he's
created his own custom hardware configuration. I guess that doesn't
bother me.

The second nit has to do with DTS's treatment of leap seconds. I appear
not to have been clear here. Dave's original document was basically
correct in its description of how DTS handles leap seconds - Servers
increase their inaccuracy at the month boundary and a time provider
narrows the interval later. When I wrote: "Each server has to maintain
and propagate this state before the leap insertion. This is, of course,
subject to Byzantine failures. A failing server can insert a bad
notification." I was describing my understanding of (and a problem with)
NTP's leap second handling. If my understanding of NTP is incorrect, I
apologize, but the Byzantine problem seems real to me.

In my reply to Dave I stated "DTS will not claim to have synchronized
clocks to a level which it has not, even in the presence of asymmetric
delays. NTP can and has." Dave correctly points out that "NTP does not
claim to have synchronized to any level, only to minimize the level of
probablistic uncertainty and estimate the error incurred." I retract the
last sentence. What I was trying to say is that DTS makes it clear to
the user he could be losing in this way, while NTP does not make it
clear.

Dave asked if I am in substantial agreement with the statistical models
presented in his first document. I agree with most of this section. My
only significant disagreement is with the last paragraph. It is true
that DTS assumes that a system's clock drifts at a rate less than its
manufacture's specification, and that a hardware time provider operates
within specification. The probablities of these assumption being false
are on the order of magnitude of other hardware failures. Software
implementation do not routinely checksum memory any more (and they
certainly don't do it to find memory errors). Violations of these
assumptions represent faults, just as real as processor faults, and
should be fixed. Note the long tails you observe in the distributions in
the Internet are on message transmission times and the like. These
parameters are dynamically measured in the DTS algorithm. Wick Nichols
stated: "DTS is willing to accept historical estimates of the
probability that a clock will go faulty (with checks for faultiness),
but is not willing to accept historical estimates of current network
characteristics." in his discussion of this point for the OSF.

Dennis asked a question about DTS authentication in the Internet
environment. What I personally would like to see is an implementation of
DTS using Apollo's NCS which in turn used Kerberos authentication. This
is basically what Digital has proposed to the OSF in response to their
distributed computing request for technology.

Now to the major contention of Dennis's review, that accuracy and ease-
of-management are "completely and utterly orthogonal". I disagree with
this less then a reader of my response to Dave might think, though I am
somewhat in disagreement with it. What I hold is that ease-of-
management, provablility and accuracy for a time service are all
interrelated. I believe that ease-of-management is something that must
be engineered into a system from the start, it can't be tacked on as an
afterthought. Further, I believe simple systems are easier to manage
then complex systems. The failure modes that Murphy can find in complex
systems are just so much more (for lack of a better word) complex. Now,
much of DTS ease-of-management derives from its autoconfiguration, the
autoconfiguration is, in turn, dependent on the (relative lack of)
configuration rules, that is, that synchronizing any two servers just
works and cannot lead to instabilities. The problem with just adding the
NTP local clock model to DTS (as I understand the NTP local clock model)
is that the resulting system could have wild instabilities. (Maybe my
understanding of NTP is incorrect here.) The dynamic nature of the DTS
autoconfiguration rules (couriers choosing a random member of the global
set for instance) means the the input time driving the local clock model
will have what Dave calls "phase noise". As I understand NTP's local
clock model this is where the instability creeps in. Further, the
existing NTP protocol avoids loops by using a stratum concept, again the
DTS autoconfiguration happily produces loops. As I noted previously this
doesn't effect the DTS algorithm, but they would cause havoc for NTP.
Again one could add complexity to the DTS algorithm to prevent the
loops, but I claim one would pay a price in system management cost.
Another problem (according to Dave) is that the resultant phase locked
loops have to be analyzed in the light of assumed probability
distributions, etc. and one does not end up with the sort of proofs of
correctness that are what is liked in DTS. There is one interesting
aside on this last point. I believe there is a way one could add clock
training to the DTS model and preserve the correctness. If the training
algorithm decides to change the rate of the clock by some amount,
*increase* the maximum drift rate by that amount. I believe this can be
shown provably correct by the techniques in Marzullo's thesis. However,
while this improves the precision of the time (the intersystem phase
differences and rate differences will be smaller) the inaccuracy (the
guarantee given to the user) will be worse! That DTS has chosen not to
do this is, of course, the basic philosophical difference about what's
important showing up again. However, the existance of at least one
method to incorporate clock training into a provable system gives hope
that both camps can be satified and in particular that the large body of
work on the NTP local clock model can be incorporated. I am not (yet)
expert enough in the NTP local clock model to see my way through this.

The other obvious possibility is to just add the autoconfiguration to
NTP. To an extent, this is occuring. The multicast functionallity
clearly addreses the ease-of-management issue. However, for NTP servers,
I claim that chosing the right server is important enough that it can't
be left to an algorithm. Swithing at random between servers reintroduces
the clock-hopping problem (the the extra phase noise produced by the
clock-hopping will cause problems for NTP.) One could attempt to just
pick a set of server at random and stick with them for some long time to
reduce clock hopping, but that will produce serious sub-optimality in
the case of a changing network configuration (The particular servers
being synchronized with might become cut-off from their good time
sources, or the paths to them might involve links which become
overloaded, and this wouldn't be discovered for a long time.)

My desire is a solution which maintaines provability and doesn't require
a large tradeoff between autoconfiguration and accuracy. So far the only
improvement I can make in the DTS architecture forces me to give up one
measure of accuracy to improve another. More work needs to be done here.

------------------------------------------------------------------------

Date:     Sat, 24 Mar 90 19:14:56 EST
From:     Dennis Ferguson <dennis@gw.ccie.utoronto.ca>
To:  comuzzi@took.dec.com
Cc:  Mills@udel.edu, elb@mwunix.mitre.org, marcus@osf.org
Subject: Re? More discussion of the differences (and similarities) of
DTS and NTP.
I realize that, while the NTP local clock processing is not a
particularly difficult coding exercise, it rests on a foundation which
is the subject of lots of textbooks and quite a number of academic
journals. I also realize the presentation in Appendix F of the current
NTP document certainly does not derive this stuff from first principles,
since that would require turning the NTP spec into another control
theory textbook. I can understand the derivation in there (though I
couldn't have produced it, and could verify it only with great
difficulty) by virtue of a somewhat academic, traditional engineering
background (Dave seems to have a very academic, traditional engineering
background), but I realize the stuff may look more than a little opague
if no one has ever forced you to learn/use that stuff. It is, however,
very worth while to get a handle on what is in there, at least
functionally, because this can make orders of magnitude difference in
the results you get from your time protocol.

Let me a number of assertions, some of which I'm not going to be able to
support here, but which can be discovered by looking very carefully at
the local clock description. I think you will find that it is not what
you think.

(1) The NTP local clock does for NTP (and potentially for DTS) what
    adjtime() does for DTS. It is essentially a procedure which is
    called with a time offset as an argument, and which does something
    to the system clock as a result.

(2) Like adjtime(), the NTP local clock is fully deterministic. There
    are no probability considerations here. When you give it a time
    offset, the effect this has on the system clock is fully
    predictable. Hence the NTP local clock can have no effect on your
    ability to maintain correctness, any more than the behaviour of
    adjtime() has any effect on your ability to maintain correctness.
    The NTP local clock does what it is told, no more and no less.

(3) I think the specified NTP local clock is (should be, I have to trust
    Dave's math for this) unconditionally stable for all input. Note
    that "stable" in this context has a very specific meaning, and may
    not be what you expect. If the NTP local clock wasn't
    unconditionally stable for all input with its current parameters, it
    could be made that way by adjusting those parameters. The stability
    of the local clock is predictable.

The local clock, in both NTP and DTS, takes time offsets from UTC as
input and attempts to adjust the system clock in response to these, in
principle to make the offsets smaller. Note that this is a feedback
loop, the the adjustments which are made to the system clock affect the
next offset which is input to the local clock.

Now suppose a DTS client has a clock which drifts by 100 ppm. Suppose it
also manages to obtain offsets every 15 minutes, by exchanges with its
servers, which represent the difference between the client's system
clock and UTC accurately to the nanosecond. The client gives an offset
to the DTS local clock, essentially adjtime(), which slews the clock by
the amount of the offset, and then sits around waiting for the next
offset to arrive. Of course, while adjtime() is slewing the clock
towards 0 offset, the clock's drift is slewing it away at a rate of 90
ms per 15 minutes.

At the end of fifteen minutes the whole thing will be repeated. Also, as
adjtime() will typically slew the clock at a much greater rate than 100
ppm, the system clock offset from UTC will show a sawtooth waveform
varying between 0 and 90 ms with a period of 15 minutes, and will
average 45 ms off. So, for absolutely perfect data in, the DTS local
clock can get the system clock no closer than 45+-45 ms. This is
inaccurate (it isn't even meaningful to say that you are getting perfect
data in, either, since the jitter will be fed back as an uncertainty in
the input). Indeed, this inaccuracy is not unexpected. This is what is
called a Type I feedback loop (I think), and is known to be unable to
track an input without introducing a phase error on the output. Worse,
for perfect data in consumers of time on that machine will see an
uncertainty of +-90 ms, even though the true uncertainty is only half
that (0 to 90 ms), and even though the input data is perfect. This is
gross.

The NTP local clock does much, much better, by a couple of techniques.
First, the jitter the DTS system clock experiences is essentially
eliminated by not feeding an entire adjustment to the system clock all
at once, but rather by making very small adjustments at frequent
intervals (currently 4 seconds in the spec). The NTP local clock hence
avoids introducing the sawtooth, the clock changes slowly and smoothly.
Further, the NTP local clock corrects the average error by computing and
applying a correction for the drift, by implementing a Type II feedback
loop. Essentially, for perfect data in, the NTP local clock will
eventually determine the drift of the system clock and, when it does,
will maintain the average offset of the system clock at 0. I.e.,
perfectly accurate. A Type II feedback loop will track a fixed input
accurately. By applying many little corrections to the system clock
instead of one big one, NTP will also maintain the system clock
relatively jitter free (one could say absolutely jitter free in
comparison to DTS).

Of course, there is no such thing as perfect, but we can begin to assign
relative orders of magnitude to errors of the DTS and NTP local clock
schemes (this comparison is deterministic as well, there are no
probabilities involved). DTS' local clock error is proportional to the
drift. NTP's local clock error is proportional to the rate of change of
the drift with respect to time multiplied by the time constant of the
PLL. In the real world the latter is less than the former by at least
several orders of magnitude.

I have grossly oversimplified this, but please believe that the details
are all in the NTP spec if you decode them. Now, how can replacing DTS'
use of adjtime() with something which is more accurate affect DTS'
configurability? How could it possibly affect correctness? I stand by my
original statement, that the issue of accuracy (with respect to the
local clock) is completely and utterly orthogonal to any issues of
configuration or management. DTS just didn't include the machinery to
condition with the local clock (and apparently is hung up on type I
feedback loops for clock control when there are other better ways to do
it), and this is what is so frustrating about it.

I also think your claim that using local clock processing which does
frequency compensation would require one to *increase* the inaccuracy
must either be wrong, or indicate that DTS is doing something very
silly. Does it make sense that something which can be analytically shown
to increase accuracy increases the inaccuracy? Does it make sense that,
if I tune a crystal for zero drift in hardware with a screwdriver that
the inaccuracy should decrease, while if I tune it for zero drift in
software the inaccuracy must increase? The mind boggles.

Further, I think you may still be operating in an unreal world with the
assertion that a manufacturer could possibly define a maximum drift for
the clock in a particular model of machines, one that could never be
exceeded under any circumstances which couldn't be called hardware
failure. I would suggest to you that the very worst cause of clock drift
in real systems is not hardware at all, but rather lost clock interrupts
(which cause large, negative drifts). Would you be willing to certify
that DEC hardware/operating systems never lose clock interrupts under
any circumstances? Or provide a guaranteed limit to the number that will
be lost in any time interval? You can call lost clock interrupts a
"fault" if you wish, but implying that such "faults" are "historically"
a rare occurance flies in the face of experience, at least with Unix
systems.

Which brings up another assertion I am begining to doubt, that NTP does
not (or cannot) know the inaccuracy of its time estimate. Joe, take a
look at how NTP's synchronization distance is accumulated. Don't worry
about the value of this that the spec suggests be used for stratum 1
servers, let's assume that this is configured for stratum 1 servers in a
way which agrees with DTS. Look carefully at how the synchronization
distance a stratum 3 server will receive. Do you see any reason why I
could not assert that UTC must be contained in an interval which is +-
1/2 the synchronization distance from the system clock's time, and prove
this assertion by the same principles that DTS uses? Or, if not, that
there are any uncorrectable imperfections in this assertion?

I am hence beginning to think as Dave, that NTP's time could be proven
to lie within known bounds, just as DTS' can. I see nothing that would
prevent this. If applications needed to know this, I think it could be
arranged for NTP to provide it.

The question becomes, then, what is all that statistical junk that NTP
does? I think the issue that is being missed here (and I'm on thin ice
again) is that DTS does indeed make some assumptions about probability
distributions. In particular, all correctness can give you is an
interval which should include UTC. The system clock, however, cannot be
set to an interval, it needs a specific value. DTS hence arbitrarily
assumes that any time in the interval are equally likely to be UTC as
any other, and hence picks the middle of the interval as this minimizes
the probable error based on that assumption (the fact that not picking
the middle of the interval increases the inaccuracy interval
demonstrates a flaw in the DTS protocol which is also exercised by the
local clock processing. The protocol demands that the intervals be +-
something even when it might be known that the true interval is
+something -something-different. DTS hence claims inaccuracy intervals
which are often bigger than they should be simply because it lacks the
ability to represent the true state of affairs. This doesn't affect
correctness, but reduces the utility of knowing the inaccuracy
interval).

NTP, however, assumes that the probability distribution over the
interval is non-uniform, that some times within the interval are more
likely to be UTC than others (this isn't strictly true, but I see no
reason why it couldn't be). It proceeds to determine the time within the
interval which is most likely to be UTC and sets the clock to that. NTP
does this in part by casting off samples and servers which it thinks are
less reliable. If DTS can correctly synchonize to a single server,
however, then casting off servers you aren't fond of can't affect
correctness. NTP choses servers it likes (and samples from servers it
likes) based on presumed characteristics of the probability
distributions of network traffic. This has no effect on correctness,
however, and in the extremely unlikely event that the network does not
behave in the way that NTP expects, NTP's choice of UTC will probably be
no worse than DTS'. This has no effect on NTP's ability (or lack
thereof) to produce correct bounds on the estimate, in DTS' sense. Also,
I find it strange that DTS could claim that avoiding having to "accept
historical estimates of current network performance" is a feature, when
this is done by making assumptions about probability distributions which
have no basis in practical reality at all.

A couple of things come to mind. Joe, you mentioned something about
"wild instabilities" which Dave said NTP might suffer in the face of
poor server selection, or some such. I would suggest to you that the
term "wild instabilities" is one which is relevant only in relation to
one's expectation of the performance of your time protocol. To anyone
who thinks an error of 45+-45 ms in the time the system clock returns
when given perfect data is acceptable, NTP is as solid as a rock. NTP's
"wild instabilites" are only relevant if your expectations are much
higher than DTS', since "wild" for NTP is a lot smaller than the
instabilities DTS apparently considers normal.

More than this, I fail to understand the rest of the arguments about why
NTP couldn't be retrofitted with an autoconfiguration protocol. NTP has
no configuration rules, it places itself in the heirarchy based on the
servers available to it. NTP will operate in such a way as to maximize
the probable accuracy of its time no matter how it is configured. Give
your NTP daemon a random set of peers and it will chose the best of
them, adopt a stratum which is appropriate based on the time sources
available, and make the best use of the time available. Concerns about
"phase noise" (i.e. jitter) are again based on expectations of the
performance of the time protocol, an NTP server which takes time from
your 45+-45 ms host will survive just fine, and indeed will show far
less jitter than +-45 ms (look at the local clock, high frequency noise
is damped out). It is just that NTP servers are expected to be a lot
closer than 45 ms, so your host looks bad compared to NTP's expectations
(but not needs). Further, the stuff about synchonization loops is
irrelevant. The NTP protocol survives loops just fine, it's just that
the consequence of a loop is that the machines involved count their
statum to infinity and disconnect from the synchronization subnet rather
than continuing to fool themselves that their servers know something
they don't. This is quite reasonable behaviour since NTP clocks don't
drift much when left unsynchronized. I would suggest to you that NTP
knows far more about Murphy than DTS does at this point, since it has
been tested in far more environments, on far more machines, in likely
far harsher environments, through far more revisions, for far longer
than DTS has. There are three independently done implementations of NTP,
all of which work well and interoperate. How complex can NTP be? NTP can
be given a random set of servers and work just fine, thanks. This wasn't
done with auto-configuration specifically in mind, but rather simply to
meet robustness requirements. NTP has a lot of real world experience to
prove that it is robust, and that it is certainly robust in the face of
even gross misconfiguration. What more is needed for an auto-
configurable protocol? Autoconfiguration certainly couldn't do a worse
job than people do.

As for Byzantine failures, you are right that NTP's scheme for leap
second notifications suffers from this, but what is the worst thing that
can happen if this occurs? Right, your clock ends up a second off. With
DTS, however, it is a virtual certainty that the clock will end up a
second off when a leap second occurs, so criticizing NTP for leaving
this hole is a little like the pot calling the kettle black. That DTS
increases its inaccuracy by a second is irrelevant for comparison, since
NTP doesn't maintain this inaccuracy. If (when) NTP maintains an
inaccuracy interval it should probably increase it by a second during
leaps as well. This doesn't help keep your clock accurate across leaps,
though.

For broadcast time, however, I think this is incorrect. The NTP clock
selection code includes an agreement protocol, and this is still used
for broadcast time. To cause a failure one would have to co-opt a
majority of the servers, and this is hardly less robust than DTS. I can
see little more exposure to such failures with broadcast time than with
polled time, and I think we must agree that polled NTP is not less
robust in the face of Byzantine failures than DTS. Further, if you are
really worried about hostile attacks on your clients then you'll be
using authentication anyway, in which case there is no additional
exposure to such failures.

More than this, note that NTP's multicast time is used in the LAN
environment. The transit delays here are a few milliseconds, and indeed
my daemon includes partially implemented code to determine these delays
on the fly by polling. This transit delay isn't even measureable by a
lot of machines (between machines with 10 or 20 ms clocks you end up
computing absurdities like negative round trip delays. Does DTS handle
this?), so DTS isn't necessarily going to know a whole lot about this
delay by polling anyway. Now, you are willing to accept a 45+-45 ms
error in the setting of your clock and a 90 ms inaccuracy, for perfect
data in, due to the primitive local clock processing that DTS does, why
the heck not add in another, say, 50 ms or something equally outrageous,
to the inaccuracy interval for the broadcast and forget about it? The
chance of it ever exceeding 50 ms (or whatever) is of about the same
order as, say, lost clock interrupts or a hardware failure ruining your
assumptions about the maximum local oscillator drift. Call the big delay
a network "fault" and forget about it.

The real advantage of multicast time is that you can update a large
number of clients very frequently with next to no traffic. One minute
updates can allow much greater precision under adverse conditions than,
say, 15 minute updates in any case, yet the cost of serving 300 hundred
clients this way drops from hundreds of packets per minute to one packet
per minute. You may not care about this, since DTS often seems to me to
be little concerned with accuracy in any case, but some people like it.
And note that you've already allowed probabilities to creep into your
inaccuracy interval by assuming one can configure a maximum drift for
any system (of course, you punt on this by calling violations of the
assumption "faults"), it seems to me that assumptions concerning one way
delays across LANs do not increase the probability of the inaccuracy
being wrong (and, of course, you can always call such cases "faults").

One comment about NCS and Kerberos for authentication. Authentication is
notorious for increasing code path lengths (possibly asymmetrically) and
this affects accuracy adversely. This is in part why NTP includes an
integral authentication protocol, because it is concerned with accuracy
and integrating the authentication allows it to optimally control the
damage this does. Because NTP is concerned with accuracy (it is apparent
that DTS is far less so) I can't ever see the internal cryto-checksum
code being moved to an external agent (I would object if it was), since
you can always do a better job, in the real time sense, with this
incorporated as part of the protocol and coded by someone who is aware
of the issues. NTP could use help with key management, though.
Actually, upon rereading this note I find (in addition to it being far
longer than it should be) that I've taken on to much of a pro-NTP
debating tone. I apologize in advance. I guess the part that galled me
is the "not willing to accept historical estimates of current network
performance" comment as an excuse to ignore NTP's well developed, well
tested time keeping technology when designing the DTS protocol, when
what has really been done is to replace NTP's observation-based
assumptions about the probability distributions of network gunk with
assumptions concerning probabilities which have no basis in the real
world and which are made strictly to avoid exposing a defect in DTS'
representation of the inaccuracy. Dave's simulation results indicate
that it is indeed the case that you've made DTS work worse in the real
world than NTP (at keeping the clock correct, which is what I desire
most from my time keeping protocol, not at computing correct error
bounds, which I desire less anyway). NTP, with its current-reality based
assumptions on network characteristics, couldn't do worse than DTS,
whose assumptions seem based on nothing. Couple this with the stability
and accuracy that NTP's local clock gives it (and the local clock is
deterministic, the "historical estimates" fuzz can't even be used to
justify ignoring this) and you've got a powerful, proven time keeping
protocol. And DTS ignored all of it. Missing the local clock in
particular is unforgiveable.

NTP doesn't produce a "correct" inaccuracy, but there doesn't seem to me
to be any reason which prevents it from doing so. The statistical
assumptions it makes have no effect on this. If there are consumers who
would like to know the inaccuracy (I would) I think NTP could provide
it, perhaps without even changing packet format. And NTP doesn't have an
autoconfiguration protocol, but it has wide experience with people-
configuration and you couldn't possible design an auto-configuration
protocol which does worse than people.

I am sorry for the tone of this, but I can't help but take issue to the
(apparent in DTS' design) attitude that nothing in NTP was worth looking
at (from my perspective DTS' treatment of the local clock is
horrendously primitive and simple minded, for example). I understand
from your last note, however, that you are maybe growing more sensitive
to this. If we are to standardize a time protocol, let us make it a good
one by not ignoring existing experience.

------------------------------------------------------------------------

24-MAR-1990 19:15:29.73
From: Michael Soha LKG1-2/A19 226-7658 <soha@nerva.enet.dec.com>
To:  comuzzi@took.dec.com
CC:  Mills@udel.edu, elb@mwunix.mitre.org, marcus@osf.org
Subj:     Re: More discussion of the differences (and similarities) of
DTS and NTP.

     I realize that, while the NTP local clock processing is not a
     particularly difficult coding exercise, it rests on a foundation
     which is the subject of lots of textbooks and quite a number of
     academic journals. I also realize the presentation in Appendix F of
     the current NTP document certainly does not derive this stuff from
     first principles, since that would require turning the NTP spec
     into another control theory textbook. I can understand the
     derivation in there (though I couldn't have produced it, and could
     verify it only with great difficulty) by virtue of a somewhat
     academic, traditional engineering background (Dave seems to have a
     very academic, traditional engineering background), but I realize
     the stuff may look more than a little opague if no one has ever
     forced you to learn/use that stuff. It is, however, very worth
     while to get a handle on what is in there, at least functionally,
     because this can make orders of magnitude difference in the results
     you get from your time protocol.

If the author is referring to Control Theory I agree that it is the
subject of many textbooks and journal articles. However, if he's
alluding to the modeling of quartz oscillators and their stability I
have to disagree. My references, [2-6], do not use the model employed by
NTP. Refer to my initial discussion on training. Aside, my NTP
documentation, RFC 1119, does not have an Appendix F. I assume the
author is referring to Section 5, 'Local Clocks' [Ref 1].

Speaking for myself, I took several courses in Control Theory, not so
much because I was "forced", but simply because I enjoyed the subject.

Let me a number of assertions, some of which I'm not going to be able to
support here, but which can be discovered by looking very carefully at
the local clock description. I think you will find that it is not what
you think.

     (1) The NTP local clock does for NTP (and potentially for DTS) what
         adjtime() does for DTS. It is essentially a procedure which is
         called with a time offset as an argument, and which does
         something to the system clock as a result.

I agree. Both NTP and DTSS are time synchronization protocols.

     (2) Like adjtime(), the NTP local clock is fully deterministic.
         There are no probability considerations here. When you give it
         a time offset, the effect this has on the system clock is fully
         predictable. Hence the NTP local clock can have no effect on
         your ability to maintain correctness, any more than the
         behaviour of adjtime() has any effect on your ability to
         maintain correctness. The NTP local clock does what it is told,
         no more and no less.

Deterministic systems can be wrong. I don't think Joe was trying trying
to say that NTP is indeterministic (God doesn't play dice with NTP). I
think what he reacting to was the lack of a correctness proof and a
definition for correctness. As discussed previously, I believe the NTP
clock model is unsound.

     (3) I think the specified NTP local clock is (should be, I have to
         trust Dave's math for this) unconditionally stable for all
         input. Note that "stable" in this context has a very specific
         meaning, and may not be what you expect. If the NTP local clock
         wasn't unconditionally stable for all input with its current
         parameters, it could be made that way by adjusting those
         parameters. The stability of the local clock is predictable.

Reading this paragraph I see the author saying NTP should be stable and
if it isn't we can fix it. Referring to 'Modern Control Engineering' by
Ogata [Ref 10], pg 7 "From the point of view of stability, the open loop
control system is easier to build since stability is not a major
problem. On the other hand, stability is always a major problem in the
closed loop system since it may tend to overcorrect errors which cause
oscillations of constant or changing amplitude". My interpretation of
stability is just that, no oscillations. See Ogata, pg 217 'Absolute
stability, relative stability, and steady state error' for additional
discussion. Furthermore, while we are on the topic of stability, one has
to acknowledge that a type II system is more apt to be unstable than a
type I system. Again from Ogata, pg 284, "As the type number is
increased, accuracy is improved; however, increasing the type number
aggravates the stability problem. A compromise between steady state
accuracy and relative stability is always necessary". There are no easy
answers in this world. Is the NTP clock model the proper choice in terms
of accuracy and relative stability?

In summary, I agree with only one of Dennis' assertions, NTP and DTSS
are designed to synchronize clocks in a computer network.

The local clock, in both NTP and DTS, takes time offsets from UTC as
input and attempts to adjust the system clock in response to these, in
principle to make the offsets smaller. Note that this is a feedback
loop, the the adjustments which are made to the system clock affect the
next offset which is input to the local clock.

     Now suppose a DTS client has a clock which drifts by 100 ppm.
     Suppose it also manages to obtain offsets every 15 minutes, by
     exchanges with its servers, which represent the difference between
     the client's system clock and UTC accurately to the nanosecond. The
     client gives an offset to the DTS local clock, essentially
     adjtime(), which slews the clock by the amount of the offset, and
     then sits around waiting for the next offset to arrive. Of course,
     while adjtime() is slewing the clock towards 0 offset, the clock's
     drift is slewing it away at a rate of 90 ms per 15 minutes. At the
     end of fifteen minutes the whole thing will be repeated. Also, as
     adjtime() will typically slew the clock at a much greater rate than
     100 ppm, the system clock offset from UTC will show a sawtooth
     waveform varying between 0 and 90 ms with a period of 15 minutes,
     and will average 45 ms off. So, for absolutely perfect data in, the
     DTS local clock can get the system clock no closer than 45+-45 ms.
     This is inaccurate (it isn't even meaningful to say that you are
     getting perfect data in, either, since the jitter will be fed back
     as an uncertainty in the input). Indeed, this inaccuracy is not
     unexpected. This is what is called a Type I feedback loop (I
     think), and is known to be unable to track an input without
     introducing a phase error on the output. Worse, for perfect data in
     consumers of time on that machine will see an uncertainty of +-90
     ms, even though the true uncertainty is only half that (0 to 90
     ms), and even though the input data is perfect. This is gross.

I agree that if one has a very poor oscillator, it will drift on the
order of 90 msec/15 minutes. However, before I apply the NTP clock model
and expect to drive the stability to one part in 10**8, I'd first
consider what is actually possible (ie, technically sound). Another
aside, the statement "..UTC accurately to the nanosecond" is meaningless
for several reasons. First, the current state of art for time transfer
(using the Global Positioning System, GPS) can achieve a precision of a
few nanoseconds Ref [5,8]. This precision is achieved through the use of
cesium clocks, very precise position information, and long integration
times. It is unreasonably, to say the least, that one could expect to
achieve this precision in a computer network. Secondly, the use of UTC
(as defined by the CCIR, Ref 9) states that '... UTC enables events to
be determined with an uncertainty of 1 us;'. If you are looking at time
transfer with a precision on the order of nanoseconds, one must specify
Timing Center eg, UTC(NIST) (see Ref 11 for a description). The term
Type I feedback is a reference to the steady state error characteristics
for a control system. Saying that a Type I is unable to track an input
without without error is wrong (see Ref 10 ppg 283-292) A Type I system
can track a step input with ZERO steady-state error. For ramp input it
will a finite error, which can be reduced by tailoring the 7** system. A
Type II system can track a ramp input with zero steady 7** state error.
As discussed earlier in response '5**' a proper solution is 7** always a
compromise between several parameters (eg, stability, steady 7** state
error, transient errors, over-shoot, time to first zero, etc...) 7** In
other words, a type II system isn't always better than a type I.

     The NTP local clock does much, much better, by a couple of
     techniques. First, the jitter the DTS system clock experiences is
     essentially eliminated by not feeding an entire adjustment to the
     system clock all at once, but rather by making very small
     adjustments at frequent intervals (currently 4 seconds in the
     spec). The NTP local clock hence avoids introducing the sawtooth,
     the clock changes slowly and smoothly. Further, the NTP local clock
     corrects the average error by computing and applying a correction
     for the drift, by implementing a Type II feedback loop.
     Essentially, for perfect data in, the NTP local clock will
     eventually determine the drift of the system clock and, when it
     does, will maintain the average offset of the system clock at 0.
     I.e., perfectly accurate. A Type II feedback loop will track a
     fixed input accurately. By applying many little corrections to the
     system clock instead of one big one, NTP will also maintain the
     system clock relatively jitter free (one could say absolutely
     jitter free in comparison to DTS).

For perfect data in, the NTP local clock MODEL may be able to reduce the
offest to zero. However, as for the actual quartz oscillator, it
impossible. There will always be short-term drift, as discussed in the
beginning of this memo. As for, Type II vs Type I, it is naive to say
that a Type II system is always better than a Type I. What is better?
less stable? more accurate? more transient errors?

     Of course, there is no such thing as perfect, but we can begin to
     assign relative orders of magnitude to errors of the DTS and NTP
     local clock schemes (this comparison is deterministic as well,
     there are no probabilities involved). DTS' local clock error is
     proportional to the drift. NTP's local clock error is proportional
     to the rate of change of the drift with respect to time multiplied
     by the time constant of the PLL. In the real world the latter is
     less than the former by at least several orders of magnitude.

It is wrong to say that: DTS' local clock error is proportional to to
drift. The inaccuracy will grow at a rate equal to the maximum drift.
The test data to date indicates that most VAX oscillator are much better
than 1 part in 10**4. So the difference between two DTS local clocks
will be much better than the max drift. So why use 1 part in 10**4, that
number accounts for all expected drift components: initial offset; aging
for the lifetime of the product, 10 years; and environment (temperature,
humidity, voltage supply, etc).

I have a real problem with the quote 'several orders of magnitude'. I
have seen in several discussions on NTP that it can achieve stabilities
on the order of 1 part in 10**8. All of the data that I have seen on
quartz oscillators [Ref 2-7] indicate that is impossible for
uncompensated oscillators to reach a stability of one part in 10**8.

     I have grossly oversimplified this, but please believe that the
     details are all in the NTP spec if you decode them. Now, how can
     replacing DTS' use of adjtime() with something which is more
     accurate affect DTS' configurability? How could it possibly affect
     correctness? I stand by my original statement, that the issue of
     accuracy (with respect to the local clock) is completely and
     utterly orthogonal to any issues of configuration or management.
     DTS just didn't include the machinery to condition with the local
     clock (and apparently is hung up on type I feedback loops for clock
     control when there are other better ways to do it), and this is
     what is so frustrating about it.

If one adds complexity to increase the accuracy of a system, than it may
result in additional management and/or configurability. In fact I'd
argue that this is true more often than not. Saying the two are
completely and utterly orthogonal is overstating your case.

Personally I like control theory, mainly because it is a challenge to do
the right thing. The question of Type I vs Type II is (for me at least)
unrelated to why I don't like the NTP clock model. I simply believe that
it is technically unsound to 'train' an uncompensated 12** oscillator
[Ref 2-6].

     I also think your claim that using local clock processing which
     does frequency compensation would require one to *increase* the
     inaccuracy must either be wrong, or indicate that DTS is doing
     something very silly. Does it make sense that something which can
     be analytically shown to increase accuracy increases the
     inaccuracy? Does it make sense that, if I tune a crystal for zero
     drift in hardware with a screwdriver that the inaccuracy should
     decrease, while if I tune it for zero drift in software the
     inaccuracy must increase? The mind boggles.

The reason you may want to consider increase the inaccuracy is for
robustness. That is, if the correction is correct 99% of the time, then
for the 1% of the time that the correction is incorrect, you will still
contain UTC in the time interval.

     Further, I think you may still be operating in an unreal world with
     the assertion that a manufacturer could possibly define a maximum
     drift for the clock in a particular model of machines, one that
     could never be exceeded under any circumstances which couldn't be
     called hardware failure. I would suggest to you that the very worst
     cause of clock drift in real systems is not hardware at all, but
     rather lost clock interrupts (which cause large, negative drifts).
     Would you be willing to certify that DEC hardware/operating systems
     never lose clock interrupts under any circumstances? Or provide a
     guaranteed limit to the number that will be lost in any time
     interval? You can call lost clock interrupts a "fault" if you wish,
     but implying that such "faults" are "historically" a rare
     occurrence flies in the face of experience, at least with Unix
     systems.

Stating a maximum number for drift can be done if one accounts for all
of the contributors to drift (short and long term). One only needs to
examine the oscillator specification. The main contributors are: aging,
initial accuracy and temperature stability. Accounting for these
contributors one can easily show that a stability of 1 part in 10**4 is
a proper choice (assuming a lifetime 10 years). I agree that accounting
for missing clock interrupts is a very difficult problem. Does NTP have
a solution for this?

     Which brings up another assertion I am beginning to doubt, that NTP
     does not (or cannot) know the inaccuracy of its time estimate. Joe,
     take a look at how NTP's synchronization distance is accumulated.
     Don't worry about the value of this that the spec suggests be used
     for stratum 1 servers, let's assume that this is configured for
     stratum 1 servers in a way which agrees with DTS. Look carefully at
     how the synchronization distance a stratum 3 server will receive.
     Do you see any reason why I could not assert that UTC must be
     contained in an interval which is +-1/2 the synchronization
     distance from the system clock's time, and prove this assertion by
     the same principles that DTS uses? Or, if not, that there are any
     uncorrectable imperfections in this assertion?

I agree that NTP may be able to provide an inaccuracy. But you may need
additional data to account for processing delays and the local clock
resolution for each NTP node from the stratum 1 server.

     The question becomes, then, what is all that statistical junk that
     NTP does? I think the issue that is being missed here (and I'm on
     thin ice again) is that DTS does indeed make some assumptions about
     probability distributions. In particular, all correctness can give
     you is an interval which should include UTC. The system clock,
     however, cannot be set to an interval, it needs a specific value.
     DTS hence arbitrarily assumes that any time in the interval are
     equally likely to be UTC as any other, and hence picks the middle
     of the interval as this minimizes the probable error based on that
     assumption (the fact that not picking the middle of the interval
     increases the inaccuracy interval demonstrates a flaw in the DTS
     protocol which is also exercised by the local clock processing. The
     protocol demands that the intervals be +-something even when it
     might be known that the true interval is +something -something-
     different. DTS hence claims inaccuracy intervals which are often
     bigger than the? should be simply because it lacks the ability to
     represent the true state of affairs. This doesn't affect
     correctness, but reduces the utility of knowing the inaccuracy
     interval).

See my response [above] and my initial discussion on training. DTS is
not optimal since it balances the inaccuracy about the midpoint. One
need not use a balance interval by providing three datapoints: time,
+inacc, and -inacc; however we decided that this optimization was beyond
the point of diminishing returns.

     NTP, however, assumes that the probability distribution over the
     interval is non-uniform, that some times within the interval are
     more likely to be UTC than others (this isn't strictly true, but I
     see no reason why it couldn't be). It proceeds to determine the
     time within the interval which is most likely to be UTC and sets
     the clock to that. NTP does this in part by casting off samples and
     servers which it thinks are less reliable. If DTS can correctly
     synchronize to a single server, however, then casting off servers
     you aren't fond of can't affect correctness. NTP choses servers it
     likes (and samples from servers it likes) based on presumed
     characteristics of the probability distributions of network
     traffic. This has no effect on correctness, however, and in the
     extremely unlikely event that the network does not behave in the
     way that NTP expects, NTP's choice of UTC will probably be no worse
     than DTS'. This has no effect on NTP's ability (or lack thereof) to
     produce correct bounds on the estimate, in DTS' sense. Also, I find
     it strange that DTS could claim that avoiding having to "accept
     historical estimates of current network performance" is a feature,
     when this is done by making assumptions bout probability
     distributions which have no basis in practical reality at all.

'most likely to be UTC' describes why NTP is different form DTSS. NTP
focuses on accuracy while DTSS main goal is to always include UTC in the
computed time interval. Neither one is necessarily better than the
other.

As described in Dennis' memo and my responses, NTP uses a different
control mechanism than DTS (eg type II vs Type I). Given this, it is
difficult to accept the claim that NTP choice will be no worse than
DTSS'.

Two points. 1. Historical estimates of network performance may be stale
due to changes in the network layer (different routes) and the datalink
layer (changes in the bridge topology). How does one ensure that the
network performance estimates reflect the current state of the affairs.
2. The statement the 'DTS's assumption have no basis in reality' is
false. See response **14 and my initial discussion on reality.

     A couple of things come to mind. Joe, you mentioned something about
     "wild instabilities" which Dave said NTP might suffer in the face
     of poor server selection, or some such. I would suggest to you that
     the term "wild instabilities" is one which is relevant only in
     relation to one's expectation of the performance of your time
     protocol. To anyone who thinks an error of 45+-45 ms in the time
     the system clock returns when given perfect data is acceptable, NTP
     is as solid as a rock. NTP's "wild instabilites" are only relevant
     if your expectations are much higher than DTS', since "wild" for
     NTP is a lot smaller than the instabilities DTS apparently
     considers normal.

I thought this comment originally came from Dave (wrt DTSS giving time
to NTP). It may have been overstated, however the point remains that it
is more difficult to ensure stability in a Type II system (as compared
to a type I).

More than this, I fail to understand the rest of the arguments about why
NTP couldn't be retrofitted with an autoconfiguration protocol. NTP has
no configuration rules, it places itself in the heirarchy based on the
servers available to it. NTP will operate in such a way as to maximize
the probable accuracy of its time no matter how it is configured. Give
your NTP daemon a random set of peers and it will chose the best of
them, adopt a stratum which is appropriate based on the time sources
available, and make the best use of the time available. Concerns about
"phase noise" (i.e. jitter) are again based on expectations of the
performance of the time protocol, an NTP server which takes time from
your 45+-45 ms host will survive just fine, and indeed will show far
less jitter than +-45 ms (look at the local clock, high frequency noise
is damped out). It is just that NTP servers are expected to be a lot
closer than 45 ms, so your host looks bad compared to NTP's expectations
(but not needs). Further, the stuff about synchonization loops is
irrelevant. The NTP protocol survives loops just fine, it's just that
the consequence of a loop is that the machines involved count their
statum to infinity and disconnect from the synchronization subnet rather
than continuing to fool themselves that their servers know something
they don't. This is quite reasonable behaviour since NTP clocks don't
drift much when left unsynchronized. I would suggest to you that NTP
knows far more about Murphy than DTS does at this point, since it has
been tested in far more environments, on far more machines, in likely
far harsher environments, through far more revisions, for far longer
than DTS has. There are three independently done implementations of NTP,
all of which work well and interoperate. How complex can NTP be? NTP can
be given a random set of servers and work just fine, thanks. This wasn't
done with auto-configuration specifically in mind, but rather simply to
meet robustness requirements. NTP has a lot of real world experience to
prove that it is robust, and that it is certainly robust in the face of
even gross misconfiguration. What more is needed for an auto-
configurable protocol? Autoconfiguration certainly couldn't do a worse
job than people do.

     As for Byzantine failures, you are right that NTP's scheme for leap
     second notifications suffers from this, but what is the worst thing
     that can happen if this occurs? Right, your clock ends up a second
     off. With DTS, however, it is a virtual certainty that the clock
     wil end up a second off when a leap second occurs, so criticizing
     NTP for leaving this hole is a little like the pot calling the
     kettle black. That DTS increases its inaccuracy by a second is
     irrelevant for comparison, since NTP doesn't maintain this
     inaccuracy. If (when) NTP maintains an inaccuracy interval it
     should probably increase it by a second during leaps as well. This
     doesn't help keep your clock accurate across leaps, though.

The statement that 'NTP should increase it's inaccuracy' implies that
DTSS is doing the right thing. Am I confused?

     For broadcast time, however, I think this is incorrect. The NTP
     clock selection code includes an agreement protocol, and this is
     still used for broadcast time. To cause a failure one would have to
     co-opt a majority of the servers, and this is hardly less robust
     than DTS. I can see little more exposure to such failures with
     broadcast time than with polled time, and I think we must agree
     that polled NTP is not less robust in the face of Byzantine
     failures than DTS. Further, if you are really worried about hostile
     attacks on your clients then you'll be using authentication anyway,
     in which case there is no additional exposure to such failures.
     More than this, note that NTP's multicast time is used in the LAN
     environment. The transit delays here are a few milliseconds, and
     indeed my daemon includes partially implemented code to determine
     these delays on the fly by polling. This transit delay isn't even
     measureable by a lot of machines (between machines with 10 or 20 ms
     clocks you end up computing absurdities like negative round trip
     delays. Does DTS handle this?), so DTS isn't necessarily going to
     know a whole lot about this delay by polling anyway. Now, you are
     willing to accept a 45+-45 ms error in the setting of your clock
     and a 90 ms inaccuracy, for perfect data in, due to the primitive
     local clock processing that DTS does, why the heck not add in
     another, say, 50 ms or something equally outrageous, to the
     inaccuracy interval for the broadcast and forget about it? The
     chance of it ever exceeding 50 ms (or whatever) is of about the
     same order as, say, lost clock interrupts or a hardware failure
     ruining your assumptions about the maximum local oscillator drift.
     Call the big delay a network "fault" and forget about it.

How do you account for changes in the bridged LAN and/or remote bridges.
Most LANs or bridged, a change in topology will affect the network
delay. Furthermore, if one has a remote bridge, say at 56 kbps, then for
each minimum size packet (64 bytes) another 5 msec of ONEWAY delay (ie,
asymmetrical) will be introduced. We have measured ONEWAY delays on LANs
(with only local bridges) as high as 100 msec.

As discussed in previous comments I do not agree that NTP will work the
same as DTSS. As for the local clock, I do not believe that it is
technically sound to 'train' an uncompensated clock (see discussion on
training at beginning of this memo.)

     I am sorry for the tone of this, but I can't help but take issue to
     the (apparent in DTS' design) attitude that nothing in NTP was
     worth looking at (from my perspective DTS' treatment of the local
     clock is horrendously primitive and simple minded, for example). I
     understand from your last note, however, that you are maybe growing
     more sensitive to this. If we are to standardize a time protocol,
     let us make it a good one by not ignoring existing experience.

To be frank, I would say that NTP's view that an uncompensated can be
'trained' to one part in 10**8 has no foundation in the scientific
literature.

     REFERENCES

1.   Mills, D. L., "Network Time Protocol (Version 2) Specification and
     Implementation", RFC: 119, University of Delaware, September 1989

2.   Vig, J. R., "Quartz Crystal Resonators & Oscillators For Frequency
     Control and Timing Applications", SLCET-TR-88-1,US Army Electronics
     Technology and Devices Laboratory, Fort Momouth, New Jersey,
     January 1988.

3.   Bottom, V. E., "Introduction to quartz Crystal Unit Design", Van
     Nostrand Reinhold Electrical/Computer Science and Engineering
     Series, New York, 1982

4.   Frerking. M. E., "Crystal Oscillator Design and Temperature
     Compensation",  Van Nostrand Reinhold Company/Litton Educational
     Publishing, 1978

5.   NIST, "Time and Frequency Seminar - June 14, 15, 16 1988"  Time and
     Frequency Division, NIST, Boulder Colorado

6.   NIST, "Time and Frequency: Theory and Fundamentals", NBS Monograph
     140,  SD Catalog No.  C13.44:140., Boulder, Colorado.

7.   VECTRON, "Crystal Oscillators 1989", VECTRON Laboratories, Inc,
     Norwalk Connecticut.

8.   Imae, M. et al, "A dual frequency GPS receiver measuring ionspheric
     effects without code demodulation and its application to time
     comparisons", Proceedings of the 20th Annual Precise Time and Time
     Interval (PTTI) Applications and planning Meeting, Vienna, Virgina,
     1988.

9.   CCIR, "Recommendations and Reports of the CCIR, 1986 - Standard
     Frequencies and Time Signals", XVIth Plenary, Dubrovnik, 1986.

10.  Ogata, K., "Modern Control Enginnering", Prentice Hall Inc,
     Englewood Cliffs, New Jersey, 1970.
11.  NIST, "Time & Frequency Bulletin No. 388 March 1900", NISTR 90-
     3940-3 (a monthly report from NIST), Time and Frequency Division,
     NIST, Boulder, Colorado.

------------------------------------------------------------------------

Date  Tue, 27 Mar 90 4:59:06 GMT
From  Mills@udel.edu
To    comuzzi@took.enet.dec.com
cc    mills@udel.edu, dennis@gw.ccie.utoronto.ca
Subject? Re? I've sent this to everyone else, yours bounced because of a
typo.

     ... This is a note to continue the DTS/NTP comparison, because I
     too am finding this conversation fruitful. Dennis, I would be glad
     to mail you a copy of the DTS architecture if you want one. (and
     Dave, I've even changed the cover and introduction). ...

Yeah, I've learned something, too. Are you game to cast the document as
an RFC? It's too bad we didn't schmooze while the stove was still
simmering the kettle. Is the kettle still warm?

     ... Allow me to start with the decision of DTS not to support a
     multicast mode. One reason was that protocols which multicast the
     time will be subject to Byzantine failures. ...

There are two issues here, one pragmatic and the other hidden. Since NTP
multicast clients may enjoy multiple NTP multicast servers on the same
wire, Byzantine vulnerabilities are reduced. The only thing you lose is
the synchronization delay (aka inaccuracy interval - darn, we should
have both called that the "confidence interval"), which in Unix
community is imperceptable. The hidden agenda is to explore the utility
of the new IP multicast capability, which is quickly becoming ubiquitous
in the Internet R&D community. While I would like to make the case that
multicasting should be supported on its own merits, I have great
nervousness about such scenarios as CMU, which as you probably know,
runs what could be called a godzilla network of intertwined wires and
bridges. I am told that each of several NTP servers now hark upwards of
500 clients each. If each one of those clients dudes expects to rattle
chimes of, say, three servers each, the induced RF field might misdirect
planes 50 miles away.

     ... The second point, the one I was trying to raise in my reply to
     Dave, was that it was DTS's intention to have the time and interval
     information available on every node.

The inaccuracy interval is available in NTP, too, but a Unix interface
is not. While not specified in the NTP spec, a clamp should be placed on
the frequency compensation term, like +-20 ppm or something like that.
It would then be possible to make almost the same confidence statements
about NTP as for DTS. The "almost" is because of the basic difference
between the NTP selection/combining algorithm and the DTS intersection
algorithm. These issues need to be discussed at another time.

     ... The third point I'd like to discuss refers to Dave's statement
     about a "considerable Internet constituency which has noisily
     articulated the need for a multicast function when the number of
     clients on the wire climbs to the hundreds." Is it that they wanted
     multicast? Or was the real objection the practical difficulty of
     adding a second server to a LAN of 300 nodes and then having to
     change the server entry in 150 ntp.conf files to redistribute the
     load? ...

The quote is quite correct. A number of dudes jumped on me to include
multicasting in the spec. Their perception is more concerned with
network load than with correctnes; however, I readily admit they might
not have yet become sensitive to the configuration issues you raise.
However, I continue to believe the autoconfiguration issue transcends
NTP and should be considered in a wider context.

     ... The next topic I'll discuss is the area of the DTS architecture
     I personally believe is least likely to survive the standardization
     process unchanged: The timestamp format. I think we have much
     agreement here.

I still have a few scars left from old Internet wars on this point,
including the use of binary versus character-oriented formats and
whether leap seconds are meaningful. The fact that leap seconds cannot
be reliably predicted seems to be a showstopper. As long as you must
have institutional memory for them, it may be easiest to include epoch
era information using the same mechanism.

     ... There has been a fair amount of discussion about interoperation
     of the two time services. Let me try to clarify what I said (too
     tersely) in my original response to Dave's paper. There are three
     separate cases I'd like to distinguish  A) An isolated group of DTS
     systems which obtain time from NTP  B) An isolated group of NTP
     systems which obtain time from DTS.  C) A collection of DTS and NTP
     giving time to each other.

     I believe that without fairly major changes in one or both of the
     architectures case C is a problem, however it is an easily
     prevented problem (see below). Cases A and B however are very
     interesting, useful and I claim easily achievable.

I see no problem with A and B either, even to the point of equating
synchronization distance to inaccuracy interval. NTP would have to
assign a stratum number to a DTS client and DTS might want to mark the
synchronization distance provided by NTP as a "possibly unreliable
inaccuracy interval." You have enough bits in the DTS timestamp to even
do that, as well include leap warnings. I would even suggest amending
the NTP spec to include such an interface specification and the DTS spec
to include an identifier for the primary reference source. For this
reason and in order to suppress neighbor loops, NTP includes the
synchronization source in the header.

     ... Case C potentially breaks the invariants of both protocols. The
     DTS invariant is that UTC is contained within the interval. The NTP
     invariant (I'm less sure of my statement here) is that the
     frequency of good servers agree with UTC.

In fact, the intent is to phase-lock all the clocks to UTC, which means
both in frequency and time. This is not so much an invariant as a goal;
although in practice it is achieved in much the same fashion as the
power grid keeps your electric clocks humming UTC. Yeah, I tried that
too and investigated whether the power grid itself makes a usable
time/frequency transfer medium. Even though the guys in Columbus run the
eastern-divide grid from a WWVB clock, local utilities drop off the grid
from time to time and do not feel it necessary to maintain phase
continuity, but that's another story among many other hilarities to be
shared at another time.

     ... NTP has a further invariant, that there are no loops in the
     time distribution network. This is enforced by the stratum. Clearly
     if DTS took time from a collection of NTP servers and later gave it
     back to the same collection of servers, a loop could and probably
     would occur. There is a simple method to prevent this, I propose
     that the gateway described in case B above always declare itself to
     be at some fairly high (numerically large) stratum. Potential
     clients will ignore the DTS/NTP server in favor of servers which
     obtained their time exclusively via NTP (and have much lower
     stratum numbers). I'm assuming that the NTP implementation at the
     gateway can be coerced into using a fixed stratum and would propose
     a value of 16 for this purpose.

This is a useful approach and requires only minor mods to the NTP spec.
However, please don't underestimate the importance of the stratum, which
is useful to avoid instabilities such as spurious clockhopping and
loops.

     ... There's also a stratum zero which is supposed to be used when
     the stratum is unknown, however I'm not sure I understand what
     value servers which obtain their time from a stratum zero server
     will use. Do they use zero? If so, how are loops prevented amongst
     themselves? ...

Stratum zero means "undefined," which can mean a lot of things, usually
that the client has not yet synchronized. NTP peers will not synchronize
to a stratum-zero peer, but will run the protocol so the peer can get
synchronized.

     ... A few nits before we get to the meat of the discussion. Dennis
     is concerned that the drift has to be input as a management
     parameter. I'll show my vendor colors here and say this isn't a
     management parameter but an implementation detail.

Dennis is not talking about the max frequency offset (I have problems
with "drift," since in the communications field this means a change in
frequency with time.), but with the estimated frequency offset produced
by the local-clock algorithm, which is usually much lower. It can take
an NTP peer some days to finetune the frequency, compliance and whatnot
to produce stabilities in the .01-ppm range, so implementations can
reduce the time to converge by remembering the offset and recovering it
on reboot.

The problem I have with arbitrary max's is that they are quartz-centric.
You can certainly stamp the nameplate with the oscillator tolerance and
expect it to be maintained throughout the service life of the equipment,
but I sure wouldn't want to rely on the nameplate in our shop where
machines are gloriously cannibalized and CPU boards routinely swapped.
You could, of course, stash info like this in lithium along with the
Ether address and license serial, but I doubt too many would take it
seriously. Nevertheless, the same thing you say about DTS applies to
NTP, assuming the clamp I mentioned previously is added to the frequency
compensation. The nameplate would then specify the sum of the quartz
tolerance plus the clamp.

     ... The second nit has to do with DTS's treatment of leap seconds.
     I appear not to have been clear here. Dave's original document was
     basically correct in its description of how DTS handles leap
     seconds - Servers increase their inaccuracy at the month boundary
     and a time provider narrows the interval later. When I wrote: "Each
     server has to maintain and propagate this state before the leap
     insertion. This is, of course, subject to Byzantine failures. A
     failing server can insert a bad notification." I was describing my
     understanding of (and a problem with) NTP's leap second handling.
     If my understanding of NTP is incorrect, I apologize, but the
     Byzantine problem seems real to me.

I still am unsure how to incorporate a leap second into a prolific DTS
subnet, if I can use the term. I assume all the DTS gizmos in the world
will ramp up their inaccuracy intervals by one second at the end of
every month. Let's say a leap second occurs and is eventually recognized
by most of the radios (all extant radios sail right through a leap, only
to lose synchronization later and then recover it). This takes a few
minutes to a few hours. Servers and clerks will discover this fact from
two to fifteen minutes later. While it is true that the inaccuracy
interval will be correctly maintained, it may come as a shock to some
users that for some uncertain period their timestamps suddenly took a
one-second hit in accuracy. Assuming the time providers are told, either
by radio or keyboard, that a leap is nigh, it would be possible to
remove this ambiguity by stealing a bit in the timestamp format.

     ... Dave asked if I am in substantial agreement with the
     statistical models presented in his first document. I agree with
     most of this section. My only significant disagreement is with the
     last paragraph. It is true that DTS assumes that a system's clock
     drifts at a rate less than its manufacture's specification, and
     that a hardware time provider operates within specification. The
     probablities of these assumption being false are on the order of
     magnitude of other hardware failures. Software implementation do
     not routinely checksum memory any more (and they certainly don't do
     it to find memory errors). Violations of these assumptions
     represent faults, just as real as processor faults, and should be
     fixed. Note the long tails you observe in the distributions in the
     Internet are on message transmission times and the like. These
     parameters are dynamically measured in the DTS algorithm. Wick
     Nichols stated: "DTS is willing to accept historical estimates of
     the probability that a clock will go faulty (with checks for
     faultiness), but is not willing to accept historical estimates of
     current network characteristics." in his discussion of this point
     for the OSF.

I'm struggling for a way to state my position in the most compact way I
can, while still being fair to both the DTS and NTP models. I think we
both agree that there is a tradeoff between accuracy and correctness.
There will always be some tails in the error distributions for time
providers, servers and network paths, as we have amply demonstrated over
the years. Dennis' comments about missed clock interrupts are on the
mark, as well as pragmatic mysteries on reading clocks in real operating
systems. Your example of log coordination is a good one, as I have been
using NTP for several years doing just that. However, in my battles with
the old NSFNET Phase-I backbone, it was essential that transcontinental
events (on the backbone) could be tagged accurately within ten
milliseconds or so. Even today I expect NTP to correctly tag the
Norwegian atomic clock to within half a second, in spite of gross
misconduct across the Atlantic, as you may have seen. I thus see neither
the statistical approach of NTP nor the correctness approach of DTS as
necessarily "right," just different.
I did a little experiment you might enjoy. Using the simulator mentioned
previously and first the NTP and then the DTS selection algorithms, I
purposely wiggled the offset of one of three clocks from nominal to a
couple of seconds off, the idea being to create a falseticker in
gradually increasing steps. The inaccuracy interval in both cases was
calculated as in DTS, but without a time dependence, since the intervals
between updates are small and the residual frequency offset is very
small. While I hardly have enough data to make a definitive judgement, I
can say that NTP quickly tossed out the bad clock, while DTS hung on for
dear life rather longer than I would expect. I intend to play with this
some more.

My experiment pointed out a possibly noxious issue. I was at pains to
make sure the inaccuracy interval was computed correctly, starting from
the primary reference clocks that were in fact peers of the Norwegian
chimer. However, the path is quite noisy, with effect the customer
receiving the time-inaccuracy stamps can get wildly differing inaccuracy
intervals on successive samples. It would seem the DTS customer would
have to accumulate a number of samples if only to make sure the
inaccuracy interval was reliable. In principle, this is the same
strategy you suggest for time providers. I don't see anything
necessarily wrong with this, but it does demonstrate that escape from
probabilistic mechanics probably contradicts the third law of
thermodynamics.

     ... Dennis asked a question about DTS authentication in the
     Internet environment. What I personally would like to see is an
     implementation of DTS using Apollo's NCS which in turn used
     Kerberos authentication. This is basically what Digital has
     proposed to the OSF in response to their distributed computing
     request for technology.

My friends the electric spooks tell me Kerberos has real conceptual
problems and that we should salute SNDS instead. Believe it when they
tell us how to implement KMP and Firefly. Be advised it takes more than
100 ms to calculate the NTP cryptosum in an LSI-11/73 (yeah, I know I
deserve that) and this cannot be compensated unless the protocol can
measure and adjust the timestamps accordingly (my ISOfriends are much
aggravated by that position).

     ... Now to the major contention of Dennis's review, that accuracy
     and ease-of-management are "completely and utterly orthogonal". I
     disagree with this less then a reader of my response to Dave might
     think, though I am somewhat in disagreement with it. What I hold is
     that ease-of-management, provablility and accuracy for a time
     service are all interrelated.

Those guys who actually do mount and run large NTP subsubnets (Merit
runs 150 chimers in the NSFNET backbone alone, most of which have
identical configuration files) can speak eloquently about their own
hardships. That's not to say I don't believe you, just that others
should make the NTP case.

     ... The problem with just adding the NTP local clock model to DTS
     (as I understand the NTP local clock model) is that the resulting
     system could have wild instabilities. (Maybe my understanding of
     NTP is incorrect here.) The dynamic nature of the DTS
     autoconfiguration rules (couriers choosing a random member of the
     global set for instance) means the the input time driving the local
     clock model will have what Dave calls "phase noise". As I
     understand NTP's local clock model this is where the instability
     creeps in.

It's not so much the phase noise as it is the dynamics of the local
clock loop itself, sort of like requiring tickadj to have a confined
range of adjustment. The rate at which the loop corrects for time and
frequency errors is fundamental to its stability; otherwise, it could
surge in much the same fashion as if you tried to drive a car with a
half-second delay between the steering wheel and the steered wheels. I
believe, as I hope is demonstrated in NTP implementations, that
appropriate parameters can be specified and engineered for any
implementation, either DTS or NTP in much the same way that tickadj is
engineered now, even on an optional (configured) basis. I envision a
local-clock implementation appropriate for either NTP or DTS or TSP for
that matter by selecting either model with engineered parameters
determined only on the basis of whether you have a line-frequency
oscillator, an uncompensated quartz oscillator or a GPS receiver or
atomic clock. This is in fact the fuzzball implementation.

     ... Further, the existing NTP protocol avoids loops by using a
     stratum concept, again the DTS autoconfiguration happily produces
     loops. As I noted previously this doesn't effect the DTS algorithm,
     but they would cause havoc for NTP. Again one could add complexity
     to the DTS algorithm to prevent the loops, but I claim one would
     pay a price in system management cost.

I perceive the DTS model does not consider more than three "strata"
(global server, local server, clerk) necessary in a DTS subnet, right?
If this can be assured, NTP is in fact needlessly complex. However, we
are not building NTP subnets this way and have found it necessary to use
a richer hierarchy requiring more strata, even if some LANs have their
own time providers. One reason for this is a notorious distrust of time
providers, so all NTP primary servers chime (usually at least three)
other servers, not even necessarily primary servers. Also, even in this
university, which is hardly at a loss for time providers (!) we have
many stratum 4 and probably stratum 5 chimers even now.

     ... Another problem (according to Dave) is that the resultant phase
     locked loops have to be analyzed in the light of assumed
     probability distributions, etc. and one does not end up with the
     sort of proofs of correctness that are what is liked in DTS. There
     is one interesting aside on this last point. I believe there is a
     way one could add clock training to the DTS model and preserve the
     correctness. If the training algorithm decides to change the rate
     of the clock by some amount, *increase* the maximum drift rate by
     that amount. I believe this can be shown provably correct by the
     techniques in Marzullo's thesis. However, while this improves the
     precision of the time (the intersystem phase differences and rate
     differences will be smaller) the inaccuracy (the guarantee given to
     the user) will be worse! That DTS has chosen not to do this is, of
     course, the basic philosophical difference about what's important
     showing up again. However, the existance of at least one method to
     incorporate clock training into a provable system gives hope that
     both camps can be satified and in particular that the large body of
     work on the NTP local clock model can be incorporated. I am not
     (yet) expert enough in the NTP local clock model to see my way
     through this.

While it may be that the inaccuracy interval provided to users may
degrade slightly if frequency compensation is embraced, the inaccuracy
jiggles certainly can't be as bad as the rock-n'-roll I see with the
simulator and Norway data. I sense there is room to jostle on this
issue.

     ... The other obvious possibility is to just add the
     autoconfiguration to NTP. To an extent, this is occuring. The
     multicast functionallity clearly addreses the ease-of-management
     issue. However, for NTP servers, I claim that chosing the right
     server is important enough that it can't be left to an algorithm.
     Swithing at random between servers reintroduces the clock-hopping
     problem (the the extra phase noise produced by the clock-hopping
     will cause problems for NTP.) One could attempt to just pick a set
     of server at random and stick with them for some long time to
     reduce clock hopping, but that will produce serious sub-optimality
     in the case of a changing network configuration (The particular
     servers being synchronized with might become cut-off from their
     good time sources, or the paths to them might involve links which
     become overloaded, and this wouldn't be discovered for a long
     time.)

The reason for the NTP selection algorithm is to find the "best" clocks
from a population possibly including "poor" ones on the basis of
estimated accuracy, stability and so forth. The selection and weighting
factors are dynamically determined using what I hope are sound
statistical principles and are considered so important that only a
purpose-designed algorithm could do it, much less an autoconfiguration
scheme. I think what you have in mind are discovery and configuration
issues, which certainly could be improved were DTS algorithms to be
mimiced.

Do you hear a faint echo of the old Xerox Clearinghouse in the far
distance?

------------------------------------------------------------------------

Date: Fri, 30 Mar 90 08:30:48 PST
From: comuzzi@took.enet.dec.com
To: mail11: ;, "@dts-ntp.dis" <UNKNOWN@decpa.pa.dec.com>
Subject: More discussion of NTP and DTS

Dennis,

Sorry for the delay in responding, I wanted to review where we are and
try and summarize our positions. It happens that I'm not an expert in
control theory. Mike Soha, who is a student of that subject has already
responded to your discussion. I look forward to a continued lively
exchange.

Dave,

You've asked this question twice (whether DTS will appear as an Internet
RFC) and it deserves an answer. It turns out that about three or four
months ago Ross Callon tried to elicit interest in DTS in the Internet
Engineering Steering Group (of which he is a member). Ross wanted to
submit an RFC, but The response was not encouraging, basically he was
told nobody wanted to think about time services. I agree with your
observation that it would have been better to have these conversations
earlier. I'm assuming the ISEG's lack of interest will change if OSF
selects DTS as one of its Distributed Computing Environment
technologies. If that happens however, the architecture specification
will probably be tweaked to reflect other OSF technology selections,
such as which nameservice, RPC, etc., but we will submit an RFC.

Even if DTS is not selected, I believe it would still be a good idea for
DTS to appear as an RFC (I suspect the relevent powers would entertain a
DTS RFC if you supported it).

Now continuing the discussion, Dave's observation that DTS is not dead-
set against clock training is in fact correct. Clock training can be
viewed as orthogonal to the DTS architecture, though of course any
training would have to be done in a way which preserved the Marzullo
proofs. Mike describes in his note a simple proposal he has for
incorporating clock training. The question I have personally been
struggling with, and haven't had much success understanding is: In the
furture, if DTS incorporated the NTP local clock model how much of the
rest of NTP would have to come along with it? In particular, would the
various fields in the time response message (e.g., synchronizing
dispersion) be required? (It seems that these have more to do with the
clock selection algorithm) Are strata required, or is the loop breaking
mechanism of inaccuracy (synchronization distance) sufficient? Can the
local clock model be proven to be stable for all input? Earlier, I got
the impression that this could only be done by making rather strong
assumptions about the network distribution functions, etc. Now I'm not
at all sure of that, based on recent statements from Dave and Dennis. I
hope this will fall out of the continuing 'control theory' discussion.

Dave correctly observes that the DTS architecture only has a three level
hierarchy. The DTS architecture has been extended (as part of the OSF
submission) to permit the specification of a local set which is not
autoconfigured. Basically, the local set is enumerated in a nameservice
directory just like the global set. This permits construction of a
hierarchy in which one level's global set is another level's local set.
Obviously this is only autoconfiguring at the leaves, but it does permit
construction of as complex a hierarchy as one would desire. We are *NOT*
going to tell customers to do this, and the vast majority of customers
will be quite content never thinking about strata or multi-level
hierarchies.

The real difference of opinion here is whether the majority of extended
LANs will have their own time providers. DTS assumes that TPs are
becoming commonplace. In that case, one only does WAN transactions as a
check for inter-XLAN time differences, and to support small (TPless)
LANs. For this environment, the short hierarchy suffices.

Thinking about the "autoconfiguration vs. accuracy" issue in light of
the the different TP availablity assumptions, I believe I understand the
disagreement. (I think Dennis had figured this out too, I just hadn't
read what he was saying!) DTS derives much of its ease-of-management
because it only thinks in terms of a three level hierarchy, it
autoconfigures the leaves, and it uses a single global set stored in the
namespace. If NTP was used in a similar manner, then NTP could be made
equally autoconfiguring. I agree. The point I was trying to push is that
when you create a new server for NTP, you have to figure out where to
put it in the grand hierarchy and further you have to select peers for
it that will provide reasonable quality time. Now if you don't have an
NTP hierarchy, then my argument is specious -- the same strategy that
works for DTS would work for NTP.

To summarize (I believe) Dennis' position: NTP can be made just as
autoconfiguring as DTS. Indeed it already has some of the
autoconfiguration due to its multicast mode; it just needs the discovery
mechanism. The result would be more accurate than DTS, due to clock
training. My position would reduce to:

DTS is a simpler protocol which already has the autoconfiguration and
could be made more accurate by adding some clock training. These
positions are not that far apart. There is still a remaining
disagreement: how much clock training or other complexity to add and
what the optimal amount of complexity is from a cost-benefit analysis.
Now, as I understand Dave on the autocofiguration issue, his position
is: TPs are not (and will not become in the near future) that
commonplace, so more than a three level hierarchy is required. Hence
only the leaves can be made autoconfiguring and there will always be
some residual manual configuration. (I'm looking for an agreement on
what our various position are here, I am willing to accept that we
disagree for now.)

Considering the larger question of "manageability vs accuracy" there is
still a lot of complexity in NTP which manifests itself as more
management complexity. I observe there are at least three parameters
(associated with the filter and selection algorithms) whose description
includes the sentence "While the value of ... is suggested, the value
may be changed to suit local conditions on particular peer paths" (or
similar text). This seems to me a rather frank admission of more
management (as opposed to more work for implementors). We seem to have
no agreement (due to our different goals of how much accuracy to deliver
in the first place) on these trade-offs.

I believe there two issues where we have reached complete agreement:
Both Dave and Dennis seem to be in agreement that NTP can be modified to
include a provable inaccuracy bound, equivalent to DTS's inaccuracy.
(I'll stick to the DTS jargon because I'm more comfortable with it.
Dave's term confidence interval is a reasonable name.) This interval
could be provided to the end user. I believe this also. I hope that NTP
will continue to evolve in this direction, independent of the OSF
decision. Second, I believe we are in agreement on how (adequate)
interoperation between the protocols could be achieved. I'll sign-up to
discuss this in the DTS RFC.

I am curious about Dave's experiments with the DTS algorithm. Dave, can
you provide more details (possibly out-of-band to this discussion)?

This note is already longer than I intended, so I'll send it out. I'll
keep the discussion of layering a time service over an authenticator
(that Bill raised) for later.

------------------------------------------------------------------------

Date: Mon, 2 Apr 90 10:00:22 PDT
From: Michael Soha LKG1-2/A19 226-7658 <soha@nerva.enet.dec.com>
To: dennis@gw.ccie.utoronto.ca
Cc: decpa::"Mills@udel.edu", decpa::"elb@mwunix.mitre.org",
     decpa::"marcus@osf.org", soha
Subject: My response to Dennis Ferguson's NTP vs DTSS memo - retry

Dennis,
When I first read this memo, I discounted it as part of an emotional
discussion of why NTP is the right choice. However upon reflection I
felt it necessary to respond to the many issues alluded to in the memo.
The problem I have with this memo is twofold: there are several
statements that are simply incorrect, and some of the deductive
reasoning is questionable. I will apologize beforehand for my bluntness
in this memo; however, I find it necessary due to the amount of
misconceptions in areas such as: UTC, quartz oscillator theory, and time
transfer techniques.

As a general comment I find this discussion, NTP vs DTSS, becoming a
very emotional debate. In an effort to move this discussion to a more
technical plane I suggest that people specifically note their
references; I have attached mine at the end of this response.

TRAINING

I'd like to begin with a discussion of one of my major concerns with
NTP: Training of clocks (and the local clock model). Looking at Dave's
NTP specification, Ref 1, page 36 I read "The Fuzzball logical-clock
model, which is shown in Figure 3, can be represented as an adaptive-
parameter, first-order phase lock loop, which continuously adjusts the
clock phase and frequency to compensate for its intrinsic jitter,
wander, and drift.? And the last two sentences of this section (ppg 37-
38) read: "The effect is to provide a broad capture range exceeding four
seconds per day, yet the capability to resolve oscillator drift well
below a msec per day. These characterists are appropriate for typical
crystal controlled oscillators with or without temperature compensation
or oven control."

Given these statements I conclude: 1. that the clock model attempts to
remove both short term (environmental) and long term drift components
(aging and initial offset); and 2. for an uncompensated crystal
oscillator one can attained a stability on the order of one part in
10**8 (ie, 1 msec/day). I disagree with both conclusions and that is why
I cannot accept the NTP clock model.

The influences on oscillator frequency include [REF 2-6] include:

TIME - short term (noise), long term aging
TEMPERATURE - static freq vs temp, thermal history (hysteresis)
ACCELERATION - vibration, shock, acoustic noise
IONIZING RADIATION - steady state, pulse
OTHER - initial offset, power supply voltage, humidity, load impedance

The above list has a number of contributors that are of more interest to
DOD than the computer field. Reviewing a crystal oscillator catalog [Ref
6] one sees the main contributors being: time, temperature, initial
offset and power supply voltage. Now the question becomes what are the
relative magnitudes of these drift contributors. Let's look at a
uncompensated (ie, without temperature compensation) 5 Mhz CMOS Clock
Oscillator, VECTRON part # CO-416B option 5. The numbers are:

     Accuracy at 25 degrees C:     +/- 10ppm
     0 to 50 degrees C:            +/- 5ppm
     Aging                         3 ppm/year first year
                                   2 ppm/year thereafter
     Supply Voltage                little impact since it's
                                   on the order of 10**-7/% change
Now assuming one has no control over the environment (ie, it may be
sitting in my office or tuck away in a closet) then the noise floor is
on the order of +/- 5ppm (temp stability). This may seem a little
extreme, but you have to realize that a protocol designer has little
control over the internal temperature gradients of a computer system.

Given the background noise of +/- 5ppm, one cannot measure the error
associated with aging 3 ppm/year (approx 10**-9/day). The best one can
expect to do is to get close to +/- 5ppm. The bottom line is that one
cannot train an oscillator in an uncompensated environment because of
the noise (ie, instability due to the environment). Given this, how can
the NTP clock model achieve a stability of one part in 10**8 (ie, less
than a msec/day) for an uncompensated oscillator? My conversations with
other people in this field indicate that it is simply not possible.

Now why did we pick a number like 10**4. Well assuming a lifetime of 10
years, the stability of this oscillator would be (at the end of 10
years) about 36 ppm or 3.6 part in 10**5 (5 ppm for temp, 21 ppm for
aging, 10 ppm accuracy). We felt that one part in 10**4 would be true
for most of the VAX crystal oscillators.

It is my belief, that one may be able to account for the error
associated with the intial offset (ie, actual milling and polishing of
the crystal). In the case of DTSS, one may be able to improve the clock
stability from one part in 10**4 to about 1 part in 10**5. I would
simply measure the error of the clock over a day (approx 10**5 seconds).
Assuming I knew UTC to within 100 msec, I'd have enough significant data
to calculate the oscillator drift to about one part in 10**5 (10**5
seconds/100 msec = 10**6). To do this one would need a S/W clock that is
not adjusted; it must be able accumulate the error over a day. Once the
new tick value is determined, one could update the timer interrupt
routine. Note that this correction is orthognal to DTSS operation and
need not be done that frequently? the 10**5 value would be correct for
at least a year since the stability after one year would be +/- 8 ppm (5
pmm for temp and 3 ppm for aging).

------------------------------------------------------------------------

Date  Tue, 3 Apr 90 5:12:09 GMT
From  Mills@udel.edu
To    Michael Soha LKG1-2/A19 226-7658 <soha@nerva.enet.dec.com>
cc    dennis@gw.ccie.utoronto.ca, comuzzi@took.dec.com, mills@udel.edu
Subject? Re? My response to Dennis Ferguson's NTP vs DTSS memo - retry

This message is in response to your reply to Dennis Ferguson's recent
message about the NTP local-clock model and its implications. I would
like to thank you and others at DEC for the time and care you have put
into the recent message exchanges. I think we have all learned useful
things that might be applied to ongoing and future projects. However, I
want to make clear that my interest in pursuing this discussion is not
to establish which of DTS or NTP is "better," but what can be learned to
improve them or a future enhanced protocol. I realize the importance to
DEC's agenda of capturing the standards process and have no personal
interest in competing with this or obstructing it. Based on experience,
however, I do want very much to promote that, whatever standard is
adopted inside or outside the Internet community, the performance
objectives attributed to NTP are at least potentially attainable, either
in the emerging protocol stack or enhancements of it.
I suspect Dennis might want to produce his own reply; however, I will
respond to the technical points you raise. I am not including the text
of either yours or Dennis' original message, since that might increase
the bulk to unbearable levels.

What Dennis has called "clock training" and I have called "frequency
compensation" was introduced several years ago in the local-clock model
adopted in NTP. The primary reason for doing this is to eliminate the
need to precalibrate the inherent frequency offset of the reference
oscillator and to serve as a digital filter to reduce the timing errors.
In fact, the model was introduced prior to NTP and has evolved over
several generations of time-transfer systems since 1979. As you know, it
is described as an adaptive-parameter, first-order, type-II phase-locked
loop (PLL), which is analyzed in many books, including those cited by
each of us. While a type-I PLL is unconditionally stable, this type of
loop cannot remove all timing errors, since it cannot compensate for
frequency errors. A type-II PLL can do this, but this type of loop can
become unstable, unless it is engineered according to established
principles. The NTP PLL has been rigorously analyzed, designed,
simulated and implemented according to these principles. The cost for
this is additional architectural constants and tighter tolerances to
maintain overall stability.

The constants called out in the NTP spec were arrived at after
substantial analysis, simulation and experiment using Internet paths
ranging from high-speed LANs to those spanning the globe. A detailed
mathematical analysis can be found in Appendix F of the February 1990
revision of the NTP spec, which has not appeared yet as an RFC, but can
be FTPed from louie.udel.edu as the PostScript file pub/ntp/ntp.ps or I
can mail you a paper copy if you wish. By the way, the local-clock
algorithm described in the existing spec, Section 5, has minor errors in
a couple of places, including some of the recurrence equations. This
section was completely rewritten in the revised spec and several new
appendices added. I am currently working on another appendix on error
analysis.

An important principle in the design of the local-clock algorithms was
that the protocol itself should not limit the possible application to
precision time and frequency transfer and that it be scalable to very
high speeds, hopefully beyond a gigabit. Surely, there are no hosts
today that can achieve anything remotely close to 232 picoseconds, but
there are a number of time-transfer applications using special equipment
where NTP might be useful, including our own gigabit network research
program. The same principles arise when synchronizing mundane computers
on the Internet. Not all hosts can or even need to achieve millisecond
time transfer and sub-ppm stability; however, I did not want the spec or
algorithms to be the limiting factors.

I have explored in depth the design and capabilities of the local-clock
reference oscillators found in typical computing equipment and concluded
the time and frequency transfer claims made in NTP are justified.
Further discussion on this point can be found in my paper in the January
1990 issue of ACM Computer Communication Review and in my paper to
appear in IEEE Trans. Communications. PostScript versions of these
papers can be FTPed from louie.udel.edu as the PostScript files
pub/ntp/ccr.ps and pub/ntp/trans.ps or I can mail you paper copies if
you wish.

You call out specifications of a typical uncompensated quartz oscillator
as:
     Accuracy at 25 degrees C:     +/- 10ppm
     0 to 50 degrees C:            +/- 5ppm
     Aging                         3 ppm/year first year
                                   2 ppm/year thereafter
     Supply Voltage                little impact since it's
                                   on the order of 10**-7/% change

These specifications are in fact much better than those I find in
typical computing equipment, where frequency inaccuracies up to 100 ppm,
temperature sensitivities up to 1 ppm per deg C and aging rates up to
0.1 ppm per day (36 ppm per year) have been measured. By contrast, the
$700 Isotemp 5-MHz OCX0s used here have a specified stability of +-
5x10^-9 from 5 to 55 deg C and aging rate of 1x10^-9 per day after 30
days. We keep them honest with a cesium oscillator calibrated by USNO.

In spite of widespread mediocrity and while only a few NTP servers are
equipped with precision oscillators (two have cesium oscillators (three
have OCXOs and one a TCXO), the vast majority of NTP-controlled
oscillators can hold frequency surprisingly well. In these oscillators
the dominant error term is neither noise nor short-term stability, but
temperature sensitivity. Under typical indoor conditions both in and out
of machine rooms, I have often opened the PLL loop at a primary server
and found it within a few milliseconds of reference time after coasting
for some days without outside correction.

Of course, not all oscillators conform to these anecdotal observations;
however, a goal in the NTP design was to provide the highest possible
performance with whatever oscillator is available. In fact, one reason
for the adaptive-parameter design was to automatically optimize the loop
bandwidth for the particular oscillator stability characteristics, with
the baseline assumed on the basis of expected diurnal variations of a
few ppm over the 24-hour period. You quite the spec:

     The effect is to provide a broad capture range exceeding four
     seconds per day, yet the capability to resolve oscillator drift
     well below a millisecond per day. These characteristics are
     appropriate for typical crystal controlled oscillators with or
     without temperature compensation or oven control.

The intent is to state that the characteristics are appropriate for
oscillators with and without temperature compensation (indeed, the loop
adapts to each type) and (with the appropriate oscillator) stabilities
of a millisecond per day are achievable. I hope the quote was not
misleading.

For clarification on a few other points you raise, note that the NTP PLL
does not attempt to compensate for quartz aging, which results in a
gradual change in frequency over time. This of course requires a type-
III PLL, which is in fact used in some disciplined secondary frequency
standards built for digital telephone network synchronization; however,
I did not feel in this case that the additional complexity required
would be justified. I did in fact experiment with a second-order type-II
PLL in order to further minimize the phase noise, but this raised
problems due to the tight constraints on update intervals. The type-II
loop is stable throughout the range that results in two-way
reachability.

Your note suggests an alternative to the perceived complexity of the NTP
PLL is a manual observation of the frequency error measured over a day,
which could presumably be done at installation and saved in a file for
recall at system reboot. This is exactly what Dennis has done. It might
be just as easy to equip the oscillator module with a trimmer capacitor
and trim out the error when the module is built. However, the intent in
NTP was to do this automatically, with the startup value used only if
available and in order to reduce the initial convergence time.

Following are specific responses to some of your technical comments.
Dennis may have some more of his own. Your quote from "Modern Control
Engineering" by Ogata, p. 7:

     From the point of view of stability, the open loop control system
     is easier to build since stability is not a major problem. On the
     other hand, stability is always a major problem in the closed loop
     system since it may tend to overcorrect errors which cause
     oscillations of constant or changing amplitude".

You go on to say a type-II system is more apt to be unstable than a
type-I. Since both DTS and NTP derive timestamps relative to the local
clock, both are certainly closed-loop systems. As mentioned previously,
type-I systems (DTS) are unconditionally stable; however, type-II
systems can be stabilized through good engineering design such as
alleged in NTP. I can't answer the question of whether this is the best
design appropriate for all conditions; however, the design has been
validated over the sometimes ludicrously large envelope of conditions
found in Internet LANs and WANs over the past decade.

Your comment on "UTC accuracy to the nanosecond" requires frequent trips
to USNO, of course. The proper statement should be "time transfer to the
subnanosecond, UTC transfer to the limits of the available timekeeping
components and time provider." GPS can achieve precisions of a few
nanoseconds only if the ephemeris dither is turned off, which after the
recent announcement is not likely. Kinemetrics claims their GPS time
provider (with GPS receiver actually manufactured by Rockwell Collins)
is accurate to 100 ns relative to USNO and 250 ns relative to UTC. As to
the CCIR expectation of UTC dissemination to the microsecond, the whims
of the US Congress were not respected. Current legislation requires
LORAN dissemination to 500 ns and the Coast Guard expects to improve
that to the order of 50 ns. Judging from measurements made by my grad
student and published USNO corrections, there does not seem any chance
to achieve that. On the other hand, NTP was also intended for local time
transfer, for which the 232-ps resolution would seem to be justifiable.

Your comments on NTP's lack of formal proofs are well taken. While these
goals may have been neglected with respect to goals of performance, I
think we all agree that minor changes can easily be made to NTP with the
effect that claims similar to those made for DTS can be made for NTP. In
fact, as experiment I crafted Marzullo's algorithm into the NTP
simulator I use for evaluation and am testing it as part of the
algorithmic components. The result is no decrease in accuracy and,
presumably, a correctness capability. Note that the NTP "inaccuracy" is
calculated in the same way as DTS, but includes only a maximum bound on
the frequency error per day.

You make an important point about estimating the state of the network at
one time based on observations about its state at another. I worry about
this, too, and have accumulated a rather large collection of measurement
data between NTP servers in the Internet. Some conclusions on this issue
can be found in the papers cited above. In particular, I have used NTP
on many occasions as a management tool to detect changes in network
routing and as an alert for congestion conditions. The fact that it runs
continuously and produces accuracies to the order of a few milliseconds
on most primary and secondary servers relative to extant path delays has
proved a highly useful diagnostic tool.

Your comment that one-way delay asymmetries can lead to estimation
errors applies to both DTS and NTP, of course. However, most NTP servers
run the protocol with at least three peers via diverse paths and some of
them use the algorithms described in the 1978 NBS monograph you
reference to reduce the errors. The February 1990 spec revision
describes how this is done and presents the statistical justification
for it. In practice, asymmetries as large as the 100 ms you report are
quite rare on the Internet, although those of 10-20 ms are common and
some (US-European) paths are as high as 70 ms. Mixed satellite-
terrestrial paths have in the past haunted us, but the only ghost left
now seems the USAN network.

While I realize our recent message exchanges have required substantial
time investments for each of us, I would like to again emphasize the
value of an ongoing dialog within the research and engineering
communities. I have strived to maintain an objective and productive tone
in these exchanges and would like to encourage you to share ideas,
experiences and even flames with us and to participate in experiment
opportunities as they develop.

------------------------------------------------------------------------

Date:     Tue, 3 Apr 90 19:42:38 EST
From:     Dennis Ferguson <dennis@gw.ccie.utoronto.ca>
To:  soha@nerva.enet.dec.com
Cc:  comuzzi@took.dec.com, mills@udel.edu
Subject: Re? My response to Dennis Ferguson's NTP vs DTSS memo - retry

I must apologize for the tone of the last message. Chalk it up to a
little bit of frustration concerning the course things seemed to be
taking. Let me make it very clear that my interests are in good quality
network timekeeping. I have some understanding of the issues, and I like
implementing software which does this stuff well. I have no emotional
attachment to either NTP's, or anyone else's, packet or timestamp
formats, nor do I stand to gain much benefit from playing with this
stuff. I will likely do a DTS implementation no matter what form it ends
up being standardized in, I'm just not going to like it much if it isn't
a good protocol.

What I do like is good quality timekeeping. I suspect NTP's encounter
with DTS will make it a better, more usable protocol, At this point you
can bet that NTP version 3 will include a correctly determined
uncertainty interval, and very likely won't go out without an
autoconfiguration protocol and procedures for authentication key
management as well as an SNMP MIB. None of these conflict with the
machinery that NTP already includes, which is very good at keeping your
clock accurate.

What bothers me is that I would not like to see an international
standard timekeeping protocol which is substantially less accurate than
NTP, just because at this stage there is just no reason for it. The idea
of not providing frequency compensation for your system clock is hence a
wee bit shocking, and I also see no reason for you to ignore the NTP
clock selection/combination procedures as long the offset it produces
lies within the uncertainty interval. Incorporating the authentication
procedures within the protocol, while unclean, also has its advantages.

In any event, I won't go on for long here since Dave has covered most of
the issues and I'd rather spend the time implementing whatever he
produces in the way of "correct NTP", if only to prove that one can be
both correct and accurate with one's timekeeping.

Just a couple of additions to Dave's comments on frequency compensation
(I didn't call it clock training, either) of the local clock. The code
implements a PLL, this is certainly covered in Time and Frequency
Fundamentals and is common place in the time keeping industry. Indeed,
if you pry the cover off of a good quality IRIG-? time code receiver (I
have technical documentation for several made by Trak Systems), or even
a good quality WWVB or GOES receiver, you will very likely find a
microprocessor inside which implements pretty much the same procedure to
synchronize the local oscillator (this is often called a "disciplined
oscillator" in the advertising brochures).

Further, NTP's local clock really is separate and distinct from the
network time exchange protocol (whether NTP or DTS). It is actually part
of the kernel in fuzzballs, and Dave has been encouraging Louie Mamakos
to insert it into the 4.4 BSD kernel behind the adjtime() interface
(something which I have some reservations about, but certainly not on
the basis of it being NTP-specific. It just isn't). NTP's local clock is
not a timekeeping protocol, it sits behind the timekeeping protocol and
receives offset estimates from the latter. Your comment about "DTS and
NTP are both timekeeping protocols" was way off the mark.

You are right that there are tradeoffs between the type II control loop
that NTP uses and DTS' type I control. Note that an error in frequency
in essence presents a ramp input, whose slope is the frequency error (or
drift), to your control loop. The slope of the input sometimes changes
(usually with temperature, I have a plot pegged to the wall beside my
desk of the temperature calibration of the crystal in my workstation,
measured with NTP. The slope is about -1.1 ppm/C), though changes to the
frequency error are normally quite small compared to the longer-term
average.

The NTP type II loop does several things right. First, it tracks the
ramp input (i.e. the frequency error) with zero steady state error. It
will also track changes in the frequency error (i.e. "drift" variations)
fairly accurately if they occur slowly enough. DTS can't do this, it
exhibits a steady state error when tracking a ramp proportional to the
slope. Changes in the slope change the steady state error.

Second, because the time constant of the loop is fairly long, the NTP
type II loop tends to damp out statistical noise in the data you are
trying to phase lock to (i.e. the offsets produced by the time exchange
protocol). DTS' type I loop, however, allows this jitter right through
to the system clock.

Third, while unrelated to type I versus type II, the NTP local clock
applies an adjustment to the system clock by making little tiny
adjustments once every 4 seconds, rather than all at once. This
spreading out of the adjustment avoids the high frequency jitter that is
otherwise caused. Note that for the archtypical, 100 ppm frequency
error, DTS synchronized clock, we've agreed that the system clock will
be zooming around by 90 ms over the 15 minute update interval, with an
average (steady state) error of 45 ms. The NTP local clock, when faced
with a 100 ppm error for which it doesn't know the correction, (this can
happen when you run the daemon on a machine for the first time, it is
analogous to a large step change in the slope of the signal you are
tracking) will exhibit a transient error which is initially about 40 ms
in magnitude as well, but this will be stable without the big jitter.

What you lose for this is speed in correcting step changes in the clock
condition. DTS can correct the system clock offset which can occur at
startup in the first update. Similarly, DTS will almost immediately
assume the steady state operating condition as soon as it is started on
a machine. Offsets due to lost clock interrupts are corrected within an
update or two as well. The NTP local clock takes longer to correct these
types of events, and/or requires a higher update rate while doing so.

Now the tradeoff. Local oscillators whose frequency is wrong, and whose
frequency varies, are a fact of life. This is very nearly always the
case. Likewise, statistical jitter in the offsets produced by the
timekeeping protocol (whether NTP or DTS) is equally unavoidable, so
damping of these fluctuations is highly desireable.

Thus the NTP local clock deals with the common case (a system clock
whose frequency is inaccurate and which varies somewhat, and somewhat
noisy data from the network) much more accurately than DTS' type I
control. On the other hand, it corrects startup transients and responds
to step changes due to things like lost clock interrupts, more slowly
than DTS does (it deals with these things, but at a more stately pace).
Note that the latter are exceptions, however, since you start the
protocol infrequently and you hope you don't lose clock interrupts at
all. Thus the NTP local clock optimizes performance in the ordinary case
at some expense to speed when handling exceptional conditions. I think
this is a good engineering tradeoff.

I think frequency compensation of the system clock is a must for DTS,
and I'm not going to be happy if it progresses towards standardhood
without it. I'm also not convinced that you shouldn't be looking at the
way NTP filters and selects samples, since this does measureably improve
your time but certainly doesn't preclude the calculation of a correct
uncertainty interval.

I think I may have said enough, since I obviously have yet to convince
anyone. The more I consider it, though, the more I think the
correctness-and-management versus performance tradeoff is just so much
baloney. These issues are all separable, I see no reason why you can't
have everything in one protocol. I think rather than trying to argue
this position, however, it might be better to spend the time on
producing a correct, autoconfiguring, accurate NTP that I can give to
people so that no one at the standards committee will believe you if you
try to foist this argument off on them.

By the way, it occurs to me that an NTP which computes a correct
uncertainty interval will allow us to make head-to-head performance
comparisons between DTS and NTP. If both protocols compute provably
correct inaccuracies, but one consistantly produces a smaller inaccuracy
on the same machines via the same network paths, is not the latter a
better, more useful protocol? I think so, and I have a distinct feeling
both NTP's local clock and clock filters are going to make this quite
interesting. We may be able to convince you DTS needs some of this stuff
after all.
------------------------------------------------------------------------