A Comparison of the Network Time Protocol and Digital Time Service David L. Mills University of Delaware 12 February 1990 Review by Joe Comuzzi, DEC Further Commentary by Dave Mills, UDel 18 March 1990 Following is a review and commentary on the above document, which is available in the file pub/ntp/dts.txt on louie.udel.edu. This document is available in the file pub/ntp/dtsrev.txt on the same host. The original document is based on the DTS specification version T1.0.5 dated 18 December 1989, which I assume can be obtained from DEC. At my suggestion Joe Comuzzi of DEC thoroughly and incisively reviewed my document comparing NTP and DTS. He found some agreement, some disagreement and some errors on my part. I much appreciate the time and care this effort required. In the same spirit, I have reviewed his comments and responded with comments of my own. As time permits I intend to incorporate appropriate revisions into the body of the original document and submit for wider distribution. Meanwhile, I offer the following discourse for further comment and evaluation. Personally, I have found the exchange useful, stimulating and suggestive of further refinements to NTP. The following discourse includes only those portions of the original document that are relevant to the reviewer's comments. These are indented three spaces. The reviewer's comments are flush with the left margin. These comments are included in their entirety and are unedited. My reply comments are preceded by a ">" symbol. References to the latest specification are to RFC-1119, with exception of the mention of new appendices in the revised version of February 1990, which can be found in the PostScript file pub/ntp/ntp.ps on louie.udel.edu. ------------------------------------------------------------------------- The Digital Time Service (DTS) for the Digital Network Architecture (DECnet) is intended to synchronize time in computer networks ranging in size from local to wide-area. You seem to be trying to clothe DTS in a propritary cloth. We now refer to DECnet as DECnet/OSI since we've incorporated OSI protocols into the protocol stack. It is our intention to pursue DTS in the OSI standards forums. > I have no intent to clothe DTS in anything other than explicitly stated > on the cover and introduction to the spec document. There is says "DNA > Phase 5 network." I will be glad to preach any other gospel or creed > practiced by DEC's men of cloth if you will change the cover and > introduction to the spec. As such it is intended to provide service comparable to the Network Time Protocol (NTP) for the Internet architecture. While both are clearly addressing the same problem space, DTS and NTP have VERY different goals. I recently spoke to the president of a time provider manufacturer and I liked his jargon, he distinguished between the time-of-day market and the frequency market. The time-of-day market wants to know what time it is, it is not interested in small errors and it doesn't want to pay a lot. The frequency market wants stable frequency sources, needs high stability and is willing to pay. > I didn't know the time providers distinguished between the time-of-day > market and frequency market. Certainly their customers don't know the > difference. No timecode receivers known to me have the requisite > stability to be considered primary frequency providers in any case; > that's what rubidium and cesium standards are for. I do not understand > the basis for your conclusion that accurate frequency costs more than > accurate time. While the algorithms are somewhat more complicated and the > host-clock implementation must be more rigidly specified, this does not > necessarily cost more, especially if there is almost a decade of research > in refining the methodology. NTP is a solution for the frequency market. DTS is only interested in the time-of-day market. The major cost for these solutions is not the initial capital investment, but the long term management and operation cost. As such DTS has goals of auto-configurability and ease of management which are not present in NTP. > If you are convinced that accurate, reliable time-of-day service can be > achieved without consideration for frequency and believe that errors as > much as several seconds per day in the absence of connectivity are > acceptable, then I won't argue with DTS being a reasonable approach. I > accept that NTP has goals primarily of stability, accuracy and > reliability and secondarily of configurability and ease of management, > since other Internet protocols would be expected to provide those > functions (see below). > (portion deleted) The goal of a distributed timekeeping service such as NTP and DTS is to synchronize the clocks in all participating servers and clients so that all are correct, indicate the same time relative to UTC, and maintain specified measures of stability, accuracy and reliability. As stated above, DTS is addressing the time-of-day market hence high frequency stability is an not a goal of DTS. > Do you mean that "specified measures of stability, accuracy and > reliability" do not apply to DTS? Should I specifically point out that > stability is a non-goal of DTS? A stability bound is in fact an > architectural constant "maxDrift" in DTS, which sounds like a > "specified measure" to me. > (portion deleted) Servers, both primary and secondary, typically run NTP with several other servers at the same or lower stratum levels; however, a selection algorithm attempts to select the most accurate and reliable server or set of servers from which to actually synchronize the local clock. The selection algorithm, described in more detail later in this document, uses a maximum-likelihood clustering algorithm to determine the best from among a number of possible servers. The synchronization subnet itself is automatically constructed from among the available paths using the distributed Bellman-Ford routing algorithm [BER87], in which the distance metric is modified hop count. Note that in DTS loops are not a problem, if a system sends out a time an ultimately gets back a derived time, due to the communication delays the derived time will always arrive back with a larger inaccuracy. The only exception to this is the possibility of a system with a time provider and a lousy clock. Then the derived time's inaccuracy could be smaller if the time was parked in a system with a good clock. But in this case the network clearly has information that the original system has lost. > It would seem that the strategy to avoid subnet loops is similar in both > NTP and DTS, although in NTP the metric is stratum (hop count) and in > DTS it is the inaccuracy interval (is there a better word than > "inaccuracy" with a more positive connotation?) Both NTP and DTS > appear to operate in similar ways to cast out noisy timecode receivers, > although it is not clear to me how the DTS manager determines from the > protocol and the radio what the inaccuracy interval should be. Both NTP > and DTS model the receiver similar to an ordinary peer, presumably with > smallest inaccuracy interval or lowest stratum. In principle, both could > estimate these and related information directly from the timecode > samples. > (portion deleted) The NTP specification includes no architected procedures for servers to obtain addresses of other servers other than by configuration files and public bulletin boards. This is a serious short-coming of NTP and definitely makes it harder to manage. It is unclear to me why you haven't fixed this since it would not seem that difficult to store server names in a namespace. > There are three issues here: (1) how to discover a set of time-servers > which are potentially useful peers, (2) how to intelligently select > an appropriate subset, based on performance expectations and (3) how > to translate names to addresses. Internet protocols are notoriously > weak on (1) and (2); however, (3) is a non-issue with NTP, since all > NTP daemons use the Internet DNS to resolve addresses from names and in > principle could use the DNS to discover servers (WKS records). For (1) > now, there is a master file on an obscure host, which is updated > haphazardly at irregular intervals using completely unauthenticated > data obtained from unreliable sources. Issue (2) is Real Hard when > the number of potential peers runs in the thousands and considerations > of network overhead, access policy and export control (drat DES) are > involved. > DTS uses LAN discovery protocols and automatic global server registration > in a global database, which vastly simplifies (1) and (3); however, I > submit that, as DTS gets bigger, (2) will become as hard in DTS as it > has in NTP. For instance, survey evidence suggests there are over 2000 > hosts supporting NTP and potentially available as servers registered > in the DNS. Using the DTS model that flushes the server list every 12 > hours and expects that every server and clerk maintains the entire > list, one might expect a good deal of network clanking, unless the list > were pruned and stratified as a cooperative management exercise. While servers passively respond to requests from other servers, they must be configured in order to actively probe other servers. Servers configured as active poll other servers continuously, while servers configured as passive poll only when polled by another server. There are no provisions in the present protocol to dynamically activate some servers should other servers fail. This is harder to fix and interacts with the spanning tree. Here at least I can see why you didn't make it easier to manage. These problems make NTP a system administrators nightmare, but are consistent with the two different sets of goals. Consistent with DTS goals we've accepted some "clock hopping" in exchange for ease of management. > I'm not sure what you mean by "nightmare." Most NTP administrators > snarf a copy of one of the two Unix daemons, compile it locally, > make an uneducated guess which existing server(s) in the master list to > use based on advice included in the distribution, build a simple > configuration file, turn the keys and walk away. In fact, DEC is > presently distributing NTP with Ultrix and includes a five-page writeup > on how to do this; which, although not an engineered solution, would > not ordinarily be considered a nightmare. > While you and I might consider NTP configuration crude, it is really no > better or worse than bringing up a j-random router or DNS server. In DTS > clients and servers wake up once in a while and solicit time in > connectionless mode on LANs and connection mode on WANs, while in NTP > peers solicit time continuously at controlled rates in connectionless > mode. On the issue of "dynamically activate," it appears that DTS does > just that with backup couriers in order to minimize WAN overhead. > This is a good thing and should be done in NTP. Dynamic activation is > on my list, but not above integration with the IP multicast service. In response to stated needs for security features, NTP includes an optional cryptographic authentication mechanism. NTP also includes an optional comprehensive remote monitoring mechanism found necessary for the detection and repair of various problems in server and network configuration and operation. It is anticipated that, when generic features capable of these functions have been developed and deployed in the Internet, the NTP authentication and monitoring mechanisms may be withdrawn. > This might be called poor-boy network management; expedient and ugly, > but necessary. An SNMP interface is in progress for one of the Unix > daemons. Same goes for the authentication mechanism, which is a > necessary feature used to partition the subnet for repair when > a server comes unglued. < (portion deleted) In DTS a synchronization subnet consists of a structured graph with nodes consisting of clerks, servers, couriers and time providers. With respect to the NTP nomenclature, a time provider is a primary server, a courier is a secondary server intended to import time from one or more distant primary servers for local redistribution and a server is intended to provide time for possibly many end nodes or clerks. Time providers, servers and couriers are evidently generic, in that all perform similar functions and have similar or identical internal structure. Not only are they generic, they are dynamic. If a time provider system loses its radio signal, it immediately reverts to a server, providing graceful degradation in the presence of failures. > Your enthusiasm is contagious. NTP does exactly the same thing. The DTS story is actually even better here, we provide a well defined time provider interface. This can be used to implement a time provider without requiring modification of the protocol portions of the time service. (On Unix systems it uses Unix domain sockets). This greatly eases adding a new time provider, and permits time provider vendors to supply it with their hardware. Note, NTP could (and probably should) do this also. We have already done it. > The NTP spec includes a procedure for time provider interface, although > the entity interactions are only informally specified. However, the NTP > interface is substantially the same as the peer interface, while in DTS > the interface is different. Perhaps the most interesting difference is > that the DTS provider interface expects a series of time values and > uses the DTS procedures to refine the estimate, which is similar in > intent to the NTP clock filter, but the NTP clock filter applies to > all peers in addition to the provider. > As specified in the introduction, the NTP spec is not intended as a > formal one (in the best and worst Internet traditions). However, we > have a little project at UDel to rewrite it in Estelle and throw test > cases at it. The project has already found a small number of minor > sleazes and obscurisms. You are to be congratulated in your formal > approach using Modula2+. Have you subjected the protocol description > to formal verification and testing? Can you make your Unix daemon > available for testing? Would you agree to publish the spec document > as an RFC? As in NTP, DTS clients and servers periodically request the time from other servers, although the subnet has only a limited ability to reconfigure in the event of failure. I don't understand this statement. Reconfiguration within a LAN is about as complete as one could imagine. The random selection of global servers is robust against any non-partitioning WAN failures. > My statement was misleading and should be clarified. Assuming the > global directory service is robust, DTS certainly is robust against > non-partitioning WAN failures; however, there are only three levels > in the DTS subnet (global server, courier/server, client). In NTP there > can be several levels or strata (commonly up to five or more). My comment > was meant in the context of reforming the NTP subnet as a spanning > tree routed at the primary servers when something croaks. This of course > requires engineered peer paths and prior knowledge of WAN connectivity, > which is certainly not among the goals of DTS. > (portion deleted) On local nets DTS servers multicast to each other in order to construct lists of servers available on the local wire. Clerks multicast requests for these lists, which are returned in monocast mode similar to ARP in the Internet. Couriers consult the network directory system to find global time providers. For local-net operation more than one server can be configured to operate as a courier, but only one will actually operate as a courier at a time. This is false, I think you're failing to distinguish between couriers and backup couriers. There can be more than one courier per LAN, each will always synchronize with at least one member of the global set. Backup couriers use an election algorithm in the absence of a courier. Only one backup courier will be elected to function as a courier. > Correction noted. Do you always expect to have multiple couriers (other > than the single elected backup) in order to insure diversity and > redundancy anyway? The local servers check each other for consistency > and those set as couriers read at least one, but not necessarily more > than one, global clock. There does not appear to be a multicast function in which a personal workstation could obtain time simply by listening on the local wire without first obtaining a list of local servers. That is correct, it would violate the principle that a message exchange has to happen in order to correctly assign an inaccuracy. > There appears to be a considerable Internet constituency which has > noisily articulated the need for a multicast function when the number of > clients on the wire climbs to the hundreds. Having responded to the > articulation noise, I thought it might be a reasonable idea to include > this capability (so far untested) on LANs with casual workstations, > promiscuous servers and simple protocol stacks. > (portion deleted) Perhaps the widest departure between the NTP and DTS philosophies is the basic underlying statistical model. NTP is based on maximum-likelihood principles and statistical mechanics, where errors are expressed in terms of expectations. DTS is based on provable assertions about the correctness of a set of mutually suspicious clocks, where errors are expressed as a set of computable bounds on maximum time and frequency offsets. This section explores these models and how they affect the quality of service. > You chose not to respond to the statistical models presented. Does that > mean you are in substantial agreement with the exposition? > (portion deleted) Both NTP and DTS exist to provide timestamps to some specified accuracy and precision. NTP represents time as a 64-bit quantity in seconds and fractions, with 32 bits as integral seconds and 32 bits as the fraction of a second. This provides resolution to about 200 picoseconds and rollover ambiguity of about 136 years. The origin of the timescale and its correspondence with UTC, atomic time and Julian days is documented in [MIL90c]. DTS represents time to a precision of 100 nanoseconds, although there appears to be no specified maximum value. The DTS time is a signed 64 bits of 100 nanoseconds since Oct 15, 1582. It will not run out until after the year 30,000 AD. Unlike NTP which will run out in 2036. I, for one, intend to still be alive in 2036! There are two reasons the 100 ns. was chosen: 1) We want to use these timestamps as a time representation, for filesystem timestamps, etc. We REALLY don't want to deal with the problem that our representation is inadequate in some reasonably future time. Also, since the 64 bits is signed, times back to 28,000 BC can be represented. This is potentially useful for astronomical data, and happily, includes all of recorded history. If we decreased the resolution, we would give up range. This choice seemed like a reasonable compromise. 2) Since we include the the transmission delay in the inaccuracy, 100 ns represents only 30 meters. Its not meaningful to talk about synchronizing clocks below that level with our algorithm. (I believe its not meaningful to talk about synchronizing clocks below that level with NTP either). The total timestamp is 128 bits, this includes a four bit version number field which would permit these decision to be revisited in the future. > I won't argue with your choice of timestamp format. My choice was > conditioned both by pragmatic issues of compatibility with other Internet > timekeeping protocols, as well as a perceived need to operate at the > highest accuracies and precisions capable of national laboratories. As > for synchronizing clocks with NTP below the 100-ns level, a project to > do exactly that is in progress here to compare LORAN-C and cesium time. > Note that not all NTP subnets operate using general-purpose computing > systems. My own zeal in pursuing the ultimate accuracy and precision > is largely conditioned by our ongoing work in gigabit network routing > and network synchronization. > In any case, the DTS timestamp format including inaccuracy and version > is a good idea. In principal, the inaccuracy is available in NTP in the > form of the synchronization distance and dispersion, but this is not > normally available at the Unix interface. > (portion deleted) With respect to applications involving precision time data, such as national standards laboratories, resolutions less than the 100 nanoseconds provided by DTS are required. Present timekeeping systems for space science and navigation can maintain time to better than 30 nanoseconds, while range data over interplanetary distances can be determined to less than a nanosecond. While an ordinary application running on an ordinary computer could not reasonably be expected to expect or render precise timestamps anywhere near the 200-picosecond limit of an NTP timestamp, there are many applications where a precision timestamp could be rendered by some other means and propagated via a computer and network to some other place for processing. One such application could well be synchronizing navigation systems like LORAN-C, where the timestamps would be obtained directly from station timekeeping equipment. There is an obvious inconsistency in your position here. If you're just using the NTP time format for synchronization, then talking about 136 year rollovers makes some sense. It could be hidden from the users by extending the protocol. If, however, as this paragraph implies you intend the NTP time format as a general timestamp, then there will be extreme pain in the year 2036. (This is refered to in DEC as the "date75" problem!) To avoid this without unduly extending the timestamp DTS has traded off being able to use its timestamp format for certain highly precise applications. > I have vivid memories of shout-out meetings in the early eighties > where we Interbums staked out positions on what you call the "date75" > problem. It seems that, no matter what resolution and rollover parameters > you select, somebody will complain the Big Bang or End of Time cannot > be represented to femtoseconds. For that matter, while my personal clock > may expire before 2036, even now I have great pain keeping track with > conventional date notation of investments that mature after the century > turns. In NTP I chose to explicitly and purposely leave out the 136-year > disambiguation function and relegate that to a higher protocol that > includes both this function and leap-second recording in network > institutional memory. Since the Earth is winding down in an unpredictable > way and papal bulls cannot endure forever and we haven't even got the > Julian days and Gregorian centuries consonant yet, I concluded that > life is too short and, like astronomers, we all should have used > (modified) Julian day-fraction reckoning in the first place. > (portion deleted) NTP specifically and intentionally has no provisions anywhere in the protocol to specify time zones or zone names. The service is designed to deliver UTC seconds and Julian days without respect to geographic position, political boundary or local custom. Conversion of NTP timestamp data to system format is expected to occur at the presentation layer; however, provisions are made to supply leap-second information to the presentation layer so that network time in the vicinity of leap seconds can be properly coordinated. DTS includes provision for time zones and presumably summer/winter adjustments in the form of a numerical time offset from UTC and arbitrary character-string label; however, it is not obvious how to distribute and activate this information in a coordinated manner. The information is used only as a help in user displays. That is, an application can display BOTH the UTC time and the local time at which a timestamp was created. It only cost 12 bits to do this. No use is made of the timezone information by DTS or by systems. > That clarifies the issue. Your intent is only to qualify the origin > of the timestamp. Point noted. NTP and DTS differ somewhat in the treatment of leap seconds. In DTS the normal growth in error bounds in the absence of corrections will eventually cause the bounds to include the new timescale and adjust gradually as in normal operation. Recognizing that this can take a long time, DTS includes special provisions that expand the error bounds at such times that leap seconds are expected to occur, which can shorten the period for convergence significantly. However, until the correction is determined and during the convergence interval the accuracy of the local clock with respect to other network clocks may be considerably degraded. The accuracy and stability expectations of NTP preclude this approach. In NTP the incidence of leap seconds is assumed available in advance at all primary servers and distributed automatically throughout the remainder of the synchronization subnet as part of normal protocol operations. Thus, every server and client in the subnet is aware at the instant the leap second is to take affect, and steps the local clock simultaneously with all other servers in the subnet. Thus, the local clock accuracy and stability are preserved before, during and after the leap insertion. Each server has to maintain and propagate this state before the leap insertion. This is, of course, subject to Byzantine failures. A failing server can insert a bad notification. > Did I miss something? By "propagate this state" do you mean DTS will > propagate advance notice of leap seconds? From what I can find rummaging > over the text, it appears that entities are expected to add one second > to their inaccuracy intervals at the end of June and December, which > would certainly shorten the convergence period if a leap did in fact > occur; However, there will be an unpredictable interval following that > when the clocks are all scurrying to catch up and network time can > be inconsistent up to a second. I worry about Byzantine failures, too. > That's why all those NTP timestamp consistency tests and, ultimately, > the NTP authentication scheme. It would appear that DTS is vulnerable > to replay in the same way NTP is vulnerable without this scheme. > (portion deleted) At first glance it may appear that NTP and DTS have quite different models to determine delay, offset and error budgets. Both involve the exchange of messages between two servers (or a client and a server). Both attempt to measure not only the clock offsets, but the roundtrip delay and, in addition, attempt to estimate the error. The diagrams below, in which time flows downward, illustrate a typical NTP message exchange in each protocol between servers A and B. A B A B | | | | t1 |--------->| t2 t1 |--------->|--- t4 | | | | | | | | | | | | | w | | | | | | | | | t4 |<---------| t3 t8 |<---------|--- | | | | NTP DTS In NTP the roundtrip delay d and clock offset c of server B relative to A is d = (t4-t1) - (t3-t2) c = ((t2-t1) + (t3-t4))/2. This method amounts to a continuously sampled, returnable-time system, which is used in some digital telephone networks [LIN80]. The derivation of the expression for 'c' above assumes the two transit delays for this exchange are symmetric. If there are systematically asymmetric transmission delays then the NTP algorithm will shift the two clocks so that they appear to be synchronized, when in fact they are systematically off by some number of milliseconds. The NTP minimum filter attempts to minimize this effect assuming that the shortest round trip exchange would have to be symmetric or nearly so. Unfortunately quite large systematic asymmetric delays can occur for a variety of reasons: source-routed networks, broken routing tables, etc. and these would apply to all transactions including the shortest. This problem exists in DTS also, but in DTS both of the systems will have an inaccuracy which encompasses the correct time. That is, DTS will not claim to have synchronized clocks to a level which it has not, even in the presence of asymmetric delays. NTP can and has. > Your observation on asymmetric paths leading to undetectable systematic > errors with both NTP and DTS is correct and is routinely observed to > varying degrees on the Internet. In fact, leaving out adjustments > necessary for frequency offset and precision (in both NTP and DTS) the > above formulas can be rewritten as presented in the DTS spec. We have > a project here designed to collect offset data from many or even all > subnet servers at non-intrusive rates in order to detect and correct > for asymmetric paths using correlation techniques. > I'm not sure what you mean by "NTP can and has" claimed to "have > synchronized clocks to a level which it has not, even in the presence > of asymmetric delays." NTP does not claim to synchronize to any level, > only to minimize the level of probabilistic uncertainty and estimate > the error incurred. In any case, what NTP calls the synchronizing > delay represents in fact the error bound relative to the synchronizing > path to the primary source. > These are probabilistic data and must be interpreted with respect to the > probability model which applies to real Internet paths. It may be that, > with an appropriate queueing model and assumed distribution functions, > a quantitative error probability function could be derived. Having > travelled those roads before myself, I conclude my pragmatic approach > to error estimation is probably as good as any. See [ALL74] for an > alternative approach. > (portion deleted) Both NTP and DTS have to do a little dance in order to account for timing errors due to the precisions of the local clocks and the frequency offsets (usually minor) over the transaction interval itself. A purist might argue that the expression given above for delay and offset are not strictly accurate unless the probability density functions for the path delays are known, properly convolved and expectations computed, but this applies to both NTP and DTS. The point should be made, however, that correct functioning of DTS requires reliable bounds on measured roundtrip delay, as this enters into the error budget used to construct intervals over which a clock can be considered correct. However, this is not at all hard to compute. Simply increase the inaccuracy by the potential drift of the local clock during the transaction. The architecture specifies this. > Not hard to do in NTP either, as the architecture specifies. The > difference is that in NTP this is represented as a time-insensitive > bound, since the architecture expects the local-clock algorithm to > compensate for frequency errors. The system expectation is that the > (corrected) local clock does not wander more than an architectural > constant of 30 ms per day. Even in NTP it might be a good idea > to ratchet up the imputed skew when all sources are lost and the > bandwidth of the tracking loop is relatively large. This will be > considered in future. > (portion deleted) NTP maintains for each server both the total estimated roundtrip delay to the root of the synchronization subnet (synchronizing distance), as well as the sum of the total dispersion to the root of the synchronization subnet (synchronizing dispersion). This synchronizing distance has a rather loose definition. I believe the current NTP RFC suggests using ten times the mean expected error for the synchronizing distance. If this parameter is important to the NTP algorithm I would expect some stronger specification. Also, where does the value ten come from? I know its experimentally derived and seems to work... > I must have confused you. Both the distances and dispersions are formally > defined in the spec. The factor of ten applies only in cases where the > delay and/or dispersion cannot be measured, such as with some timecode > receivers. Elsewhere throughout the subnet these quantities are > calculated. You will observe that the dispersion quantity is rather > artfully concocted (for efficiency reasons) and not directly convertible > to the usual second-moment statistics. Well, all I can say is that other > practitioners of these black arts mumble similar voodoo, but the > performance as error estimator is still pretty good. Now, I should make > clear that a goal of NTP is to maintain overall accuracy relative to > the synchronization distance (roundtrip delay) to the root of the subnet > on the order of one-tenth that distance. That is an arbitrary goal, but > believed achievable on the basis of past experience. > There is an interesting feature which becomes evident reading the DTS > and NTP specification documents. The DTS and NTP procedures for reading > server times, computing bounds and selecting sources are roughly the > same complexity, although NTP fiddles with both delay and dispersion. > In addition, the procedures DTS uses to adjust the local clock, compute > the correction interval and determine the next update time have roughly > the same complexity as the NTP local-clock procedure. These quantities are included in the message exchanges and form the basis of the likelihood calculations. Since they always increase from the root, they can be used to calculate accuracy and reliability estimates, as well as to manage the subnet topology to reduce errors and resist destructive timing loops. While you state the synchronizing distance and synchronizing dispersion can be used to calculate accuracy, I have never seen a derivation of how this could be done. This is one of the recurring points, the lack of formal proofs. > Formal proofs are hard to come by, unless you make drastic assumptions > on the statistical models and distributions operative in the Internet. > Certainly, by the same sort of analysis presented in the DTS spec, the > notion of correct UTC time (for a truechimer) belonging to the interval > defined as the offset estimate +-1/2 the delay estimate is valid, as > long as the frequency estimate is within the stated tolerance. However, > DTS and NTP differ substantially in the philosophy of the selection > algorithm, as explained in the text. Since your comments did not speak > directly to this issue and I suspect you remain unconvinced, a full > examination should await another time and place. > For an in-depth analysis of probabilistic models appropriate for > "well-behaved" timekeeping systems, see Appendix F of the latest spec > revision mentioned previously. You still might not like the results of > the analysis, since statistical models seldom give nondeterministic > results. I think you might not argue with a conclusion that accuracy > degrades with increasing synchronization distance and dispersion, but > might argue over the function that maps these numbers into acceptable > error bounds. For justification, see [MIL90a]. In NTP the selection algorithm determines one or a number of synchronization candidates based on empirical rules and maximum- likelihood techniques. A combining algorithm determines the local-clock adjustment using a weighted-average procedure in which the weights are determined by offset sample dispersion. < (portion deleted) The next step is designed to detect falsetickers or other conditions which might result in gross errors. The pruned and truncated candidate list is re-sorted in the order first by stratum and then by total synchronizing distance to the root; that is, in order of decreasing likelihood. A similar procedure is also used in Marzullo's MM algorithm [MAR85]. Next, each entry is inspected in turn and a weighted error estimate computed relative to the remaining entries on the list. The entry with maximum estimated error is discarded and the process repeats. The procedure terminates when the estimated error of each entry remaining on the list is less than a quantity depending on the intrinsic precisions of the local clocks involved. A point which is not discussed here is that when NTP chooses to prune an entry, it can not determine if this entry's problem is that it comes from a bad clock (falseticker in your jargon), or experienced unusually large and asymmetric network delays. The latter case is something to be expected in normal operation, the former represents a problem which should be fixed. DTS uses the interval information to identify such bad clocks, and reports them. Since if a clocks interval doesn't intersect the majority it is clearly faulty. This is, of course, a MAJOR issue in distributed system management. > NTP can determine whether a peer or radio has not responded for a "long" > time or whether the problem is excessive dispersion. NTP implementations > do keep track of both and report when a peer or radio becomes selected > or deselected, reachable or unreachable and so forth. After watching peers > and radios of various manufacture continuously for several years and > experiencing what could be considered most bizarre behavior on occasion, > I have concluded there is no way to reliably distinguish a falseticker > from simple excessive delay or propagation variance on other than a > a probabilistic basis. I claim this even after admitting the fuzzball > timecode receiver drivers have an incredible array of consistency > checking and monitoring machinery which can and often does detect a > misbehaving peer or radio. I also conclude that radio design can > be vastly improved by providing detail signal-quality information in > the timecode itself. At one time the fuzzballs carefully and > exasperatingly logged and reported every little thing, like when a > peer or radio became unreachable or experienced excessive dispersion, > etc., but now these events are logged at the server and available only > if the remote monitoring program requests them. The fundamental assumption upon which the DTS is founded is Marzullo's proof that a set of M clocks synchronized by the above algorithm, where no more than j clocks are faulty, delivers an interval including UTC. The algorithm is simple, both to express and to implement, and involves only one sorting step instead of two as in NTP. However, consider the following scenario with M = 3, j = 1 and containing three intervals A, B and C: A +--------------------------+ B +----+ C +----+ Result +-----================-----+ Using the algorithm described in the DTS functional specification, both the lower and upper endpoints of interval A are in M-j = 2 intervals, thus the resulting interval is coincident with A. However, there remains the interval marked "=" which contains points not contained in at least two other intervals. The DTS document mentions this interesting fact, but makes a quite reasonable choice to avoid multiple intervals in favor of a single one, even if that does in principle violate the correctness assumptions. Come on, this in no way violates the correctness assumption. The proofs tell us that the correct time is somewhere in the two dashed sub-intervals. By making the statement that the time is somewhere in the larger interval, a server is making a WEAKER assertion. Marzullo's proof would apply and the algorithm would work (sub-optimally) if servers arbitrarily lengthened the intervals they computed. > Zounds, you have cut me to the quick. My conclusion was based on my > reading of the text in Section 3.3 of the DTS spec and the stated > algorithm, which seemed at first reading to me at variance with > Marzullo's principles presented in the CACM paper. In your algorithm > you arrange the endpoints in a list in order of indicated times, with > lower bounds preceding upper bounds of the same value. For M-j = 2 > and the above figure, the algorithm will start at the lower limits of > A and B and work upward, then start at the upper limits of A and C and > work down. The first step will conclude the lower limit as the lower > limit of intervals A and B and the upper limit as the upper limit of > intervals A and C. Your correctness assumption uses "the smallest > single interval containing all points in at lease M-f of the intervals, > which is exactly what your algorithm computes. I can restate that by > saying you require at least one clock interval to include UTC, not that > each of the M-j = 2 clocks agrees to the same interval. As I recall, > Marzullo's paper did not consider this case, but it is a natural > extension. I conclude my claim is unfounded and it will disappear in the > rewrite. > (portion deleted) In point of fact, the local clock model described in the NTP specification is listed as optional in the same spirit as the model described in the DTS functional description. As such, the local clock can in principle be considered implementation specific and not part of the formal specification. This is a rather odd statement. What I read is that the local clock model is not explicitly required by the NTP documents, but it is, in fact, required in functioning implementations. > The intent in the original NTP spec was to define the protocol itself, > saving the filtering, selection, combining and local-clock algorithms > for later specification exercises. As a pragmatic matter, nobody would > implement NTP unless there was some guidance for these algorithms. As > the architecture and protocol was refined, it became clear that a > well performing system of clocks could not be achieved unless > certain aspects of these algorithms were standardized, namely the > parameters of the local-clock algorithm, which is at the heart of > the stability issue. You correctly observe that the NTP spec is > confusing in this area. However, as demonstrated above, frequency compensation requires the local clock adjustment to be carefully specified and implemented. The NTP mechanism has been carefully analyzed, simulated, implemented and deployed in the Internet, but DTS has not. I have never read a clear specification of the required quality of the input time to NTP. However, the following argument shows that in a LAN of typical machines, DTS can indeed provide time to NTP. The clock resolution of most machines is between 1 and 16.7 milliseconds. Thus, any single measurements made by NTP MUST experience this clock jitter. NTP can achieve better overall results only by averaging many such measurements. We have measured the 'jitter' of DTS times in LANs, it is less than 10 milliseconds, so if DTS supplies time to NTP in a typical LAN, the NTP will receive time similar in quality to the time it gets from other NTP servers. In the WAN case, the jitter may be a problem, I assume that to interoperate in the presence of WAN links may require clock training. If you could provide the derivation of accuracy from synchronization distance and synchronization dispersion that you allude to in section 4.2, this could form the basis of reliable interoperation with NTP supplying time to DTS. Alas, I suspect such a derivation is unachievable. However, for installations which are not concerned with the DTS guarantee, the time provider interface could be used to import NTP time into DTS (just like any time provider, though there would have to be a user supplied inaccuracy, based on local experience with NTP). We intend to include a sample time provider program to permit this. > As I said previously, and subject to the assumptions made there, the > NTP synchronizing distance is computed similar to the DTS inaccuracy > interval. However, a derivation of estimated error interval from measured > distance and dispersion is not achievable on other than a statistical > basis, which wouldn't do you much good. However, there is a basic > flaw to your argument in achieving interoperability with NTP. The > NTP architecture involves a probabilistic system of mutually coupled > oscillators controlled by what is called in traditional control theory > a type-II phase-lock loop (PLL). A type-II loop is necessary to estimate > frequency, as well as phase. If you accept the requirement that the > subnet of distributed oscillators must operate plesiochronously > (phase-locked to possibly many reference oscillators themselves slaved, > but not phase-locked to UTC), then you are stuck with type-II loops. > The fundamental problem with type-II loops is that they can become > unstable and sail off into the wild blue yonder if the loop time > constants are not maintained within specified tolerances. There is > much machinery in the NTP local-clock model that addresses these > issues in order to maintain stability throughout the subnet. It has > been the experience that stability can be reliably maintained over a wide > range of network delays, outages, etc.; however, the cost is a tighter > specification on the dynamic characteristics of the local-clock > algorithms. See Appendix G of the cited NTP spec revision for a > mathematical analysis of the NTP PLL. Note that RFC-1119 contains > minor errors in some of the implementation formulas. > I would in fact be possible to "take time" from a DTS server and splice > it into NTP, in spite of the probably large phase noise; however, it > would probably not be possible to integrate a DTS subnet into an NTP > system where DTS was used for time transfer between one NTP subnet and > another. > (portion deleted) It is an uncontested fact that computer systems can be badly disrupted should apparent time appear to warp (jump) backwards, rather than always tick forward as our universe requires. Both NTP and DTS take explicit precautions to avoid the local clock running backwards or large warps when running forwards. However, both NTP and DTS models recognize that there are some extreme conditions in which it is better to warp backwards or forwards, rather than allow the adjustment procedure to continue for an outrageously long time. The local clock is warped if the correction exceeds an implementation constant, +-128 milliseconds for NTP and ten minutes for DTS. The large difference between the NTP and DTS values is attributed to the accuracy models assumed. I believe the difference also comes from different assumptions of the risks (and probabilistic costs) involved in jumping the clock. We assume it is something you want to do rarely. > The NTP experience is that, with a +-128-ms window and the Internet > peers I watch, I have not observed a jump any time over the last couple > of years, except upon reboot or upon insertion of the latest leap > second, when a couple of silly implementation bugs were found. Some > users have found it necessary to upsize the window on combined > satellite/landline paths and on paths frequently experiencing severe > network congestion. In fact, we have used up to +-512 ms on some paths > to Europe and would be glad to use larger ones should that become > necessary. I think this is a non-issue with respect to comparing > the NTP and DTS models. For most servers and transmission paths in the Internet a offset spike (following filtering, selection and combining operations) over +-128 milliseconds following filtering, selection and combining operations is so rare as to be almost negligible. The duplicated text makes me think there is something wrong here, though frankly I don't understand what this paragraph is trying to say. > Probably awkwardly stated, what I'm trying to say is that the combining > and local-clock algorithms have the effect of reducing apparent errors > following the clock filter by a substantial amount over the "few > tens of milliseconds" assumed by conventional wisdom. See [MIL90a]. > (portion deleted) The service objectives of both NTP and DTS are substantially the same: to deliver correct, accurate, stable and reliable time throughout the synchronization subnet. However, as demonstrated in this document, these objectives are not all simultaneously achievable. For instance, in a system of real clocks some may be correct according to an established and trusted criterion (truechimers) and some may not (falsetickers). the models used by NTP and DTS the distinction between these two groups is made on the basis of different clustering techniques, neither of which is statistically infallible. A succinct way of putting it might be to say that NTP attempts to deliver the most accurate, stable and reliable time according to statistical principles, while DTS attempts to deliver validated time according to correctness principles, but possibly at the expense of accuracy and stability. I would claim you're understating DTS's goals of autoconfigurability and manageability. > I would be glad to elevate the consciousness of this issue in the > rewrite. In both the NTP or DTS models the problem is to determine which subset of possibly many clocks represents the truechimers and which do not. An interesting observation about both NTP and DTS is that neither attempts to assess the relative importance of misses (mislabelling a truechimer as a falseticker) relative to false alarms (mislabelling a falseticker as a truechimer). In signal detection analysis this is established by the likelihood ratio, with high ratios favoring misses over false alarms. In retrospect, it could be said that NTP assumes a somewhat lower likelihood ratio than does DTS. I'm not sure I understand your jargon here. The important trade off for DTS is to notify managers of broken clocks (calling a falseticker a falseticker) so that it can be fixed. Declaring a good clock bad (labeling a truechimer a falseticker) could only occur in DTS as an implementation error or as a massive multi-server failure. In either case a human will have to get involved. > Likelihood ratio is a tool of mathematics and estimation theory and > is frequently used in statistical signal transmission and detection. > The likelihood of an event can be computed from the probability > model and assumptions about the underlying events of that model. > For example, there are four possible outcomes of a probabilistic > hypothesis that purports to reveal the results of an experiment: > (1) you said it hit and it really hit, (2) you said it missed and it > really missed, (3) you said it hit, but it really missed and (4) you said > it missed, but it really hit. Now, a complete probabilistic analysis > would require you place weights on each of these possible outcomes, > from which you can determine the overall success of your hypothesis > construction technique. This is where the likelihood ratio comes in. It might be concluded from the discourse in this document that, if the service objective is the highest accuracy and precision, then the protocol of choice is NTP; however, if the objective is correctness, then the protocol of choice is DTS. However, the discussion in Section 4.2 casts some doubt either on this claim, the DTS functional specification or this investigator's interpretation of it. I believe you are doing your position a disservice by raising this red-herring. No one has found your argument that DTS violates the assumptions of Marzullo's thesis convincing. Lamport commented that it indicates a serious misunderstanding of Marzullo's proof. > The last sentence should be struck and tell Leslie I said "hi." It is certainly true that DTS is "simple" and NTP is "complex," but these are relative terms and the complexity of NTP did not result from accident. That even the complexity of NTP is surmountable is demonstrated by the fact that over 2000 NTP-synchronized servers chime each other in the Internet now. The ever decreasing cost of time providers argues heavily for a simple solution, even though it may require more time providers. It simply isn't worth a lot of software complexity, (and maintenance cost, and management cost) to avoid spending a few dollars to buy more providers. Further, the philosophy of 'correctness' leads to certifiable implementation by independent vendors. > I continue to believe it is not constructive to "certify correctness" in > probabilistic systems, only to exchange acceptable tolerance bounds for > acceptable error bounds. If by "time providers" you imply each is > associated with a radio clock, I do not think it likely that the > cost of a radio clock will plummet to the point that every LAN can > afford one and, even if it did, you can not trust a single radio. You > have to have more than one of them and, preferably, no common point > of failure between them. > (portion deleted) The widespread deployment of NTP in the Internet seems to confirm that distributed Internet applications can expect that reliable, synchronized time can be maintained to within about two orders of magnitude less than the overall roundtrip delay to the root of the synchronization subnet. For most places in the Internet today that means overall network time can be confidently maintained to a few tens of milliseconds [MIL90a]. While the behavior of large-scale deployment of DTS in internet environments is unknown, it is unlikely that it can provide comparable performance in its present form. With respect to the future refinement of DTS, should this be considered, it is inevitable that the same performance obstacles and implementation choices found by NTP will be found by DTS as well. I disagree with this final paragraph. I think that NTP and DTS both attain their very different goals. Our difference of opinion is in how important the different goals are. I accept that DTS will not keep clocks quite as tightly synchronized as NTP. It will, however, be a product that a vendor can confidently ship to customers who are expected to install, configure and manage it themselves. > We sure do have vastly different goals. Mine is a scientific one. I am > keenly interested in the technology of synchronizing time and frequency > to the highest degree of performance possible in the present state of > the art. I have found it useful in my own research to promote and > sustain an agenda to systematically refine NTP as an architecture, > protocol and set of implementations and promote its establishment as > an Internet Standard protocol. I also find it useful to promote, help > run and mount experiments with a largish number of Internet hosts which > now find NTP useful. I do not have a commercial agenda, nor do I have a > particular interest in the standards process other than to hope whatever > lessons learned in almost a decade of Internet timekeeping are documented > and made available to the R&D community. You may have seen my message to > the OSF in which I said the same thing and my hope that you guys, who > well might own the standard of choice, thoughtfully consider the points > I raise and think about how those features you think valuable in the > long run might be anticipated now and perhaps added at some future time. (remainder deleted) Dave