The following graphs show the fluctuations in the rates of the system clock and of the real time clocks on a variety of computers on the theory network. Until June all were synchronize against the same system, ntp.ubc.ca, a stratum 2 ntp server on campus ( the time delay is on the order of 100s of microseconds to that machine from any of these computers). as the top graph shows that server had a 3-4 msec sawtooth drift against GPS time. Thereafter, string was synchronized against tick.usask.ca, a stratum 1 server synchronized against GPS. In Sept, 2007, string was put onto ntp and sychronized against a stratum 0 GPS clock ( A Garmin 18LV GPS receiver with a PPS output) against which it maintains a roughly 2-3 microsecond offset. All of the other clocks are chrony synchronized against it. It is less a msec via switches away from all of the other clocks.
The following graphs plot the rate of the system clock vs the ntp server (red line and left hand scale) and the rate of the RTC vs the system clock(real time clock-- the CMOS clock)( dotted lines and right hand scale) against the time in days after 00:00 on the date shown. The rates are in units of microseconds per second. These rates are determined by comparing the reading on the system clock with the ntp determined times on the NTP server to adjust the rate of the system clock, and the rate of the RTC vs the system clock. Note that the strong correlation between the rate fluctuations suggests that the system clock is the primary source of noise, and that in general the RTC has better stability than does the system clock.
In the graphs for the week ending Feb 11, the huge instability in the case of one of the machines, info,i and of the other machines after they were restarted on Feb 9, is
unexplained. There seems to be an instability in the operation of chrony.
The restoration of a semblance of order after the 10th was done by
decreasing the maxupdateskew to 1/5 (from unlimited).
Dilaton was the most accurate clock in its rate fluctuations before that
restarting, but not afterwards.
Well, I have finally tracked down the problem. That stratum 2 server
ntp.ubc.ca stinks. I got a gps device with a PPS output, which I hooked up
to a couple of the machines. The most interesting is string, which had some
of the most unstable behaviour with chrony and ntp.ubc.ca. in the following
graph, I have plotted the response of string to the gps clock ( with chriny
switched off) to ntp.ubc.ca and to tick.usask.edu, a stratum 1 server.
The huge regular sawtooth waves come from ntp.ubc.ca. Not only is the
system on average about 3ms fast, its offset varies regularly.
tick.usask.edu is very much better behaved-- considering that it is almost
10 msec away ( peer delay), its accuracy differs from the gps time by only
about a few tens of a microsecond. (The "line" across the top is the gps
time, with a width, a jtter of about 3 microseconds. The jagged line
starting at 24 hr is tick.usask.ca, while the huge oscillation is
ntp.ubc.ca, a supposed stratum 2 source. It may be that because it is
running SunOS, the kernel cannot regulate the system clock properly leading
to this behaviour.
(Note that in each case exactly the same overall drift has been removed
from the data-- ie the drift was determined from teh GPS clock and then the
same drift was removed from each of the other graphs.)
What is interesting is that while the gps spikes are all late ( by a few
microseconds) both the ntp sources are early. This seems to imply that the
outbound ntp packets take slightly longer than the inbound packets.
On Apr 14 all of the machines except dilaton and string were changed to get
their primary time from string, which gets its time from tick.usask.edu.
Dilaton got its time from time-nw.nist.gov, a time server located at
Microsoft but was switched to string on Apr 15.
In August, String was switched to running ntp with a Garmin 18LVC gps
receiver delivering PPS signals to ntp. The accuracy of string then became
of the order of a microsecond.
In Nov 07, the bottom graphs were added. These give the measured offsets and
round trip delay times for string as the stratum 0 source from each of the machines. The large ( up
to 1 sec) round trip times seem to be due to problems with the switches
installed in Physics (Cisco Gigabit switches) which seem to insert
latencies of up to 2 seconds in routing the ntp packets between the various
machines and string. monopole, charge, gauge, boson, dilaton, flory, info,
fluxon are all on the same set of switches, so the delays come from single
switches.
This is especially obvious in the week ending Feb 18 Some of the machines
have huge (10ppm) fluctuations in the rate, and at exactly the same time,
others (eg charge) are running in the .2 ppm range of fluctuations.
Ie, these fluctutions are not coming from the source ntp.ubc.ca. They seem
to be inherent in the way chrony is setting the rates.
Since the time between comparison of the system clock vs the NTP server is of the order of 100-1000 sec (peer delay is .6ms typically) , the noise rate in the case of the best system would correspond to less than a millisecond drift
Notes:
|
|
|
|
|
|
|
dilaton Core 2 Duo 2.8 GHz Intel , 3GB ram,Gb ethernet gauge One 750MHz Intel Pentium III Processor, 256M RAM, 100Mb ethernet monopole 3GHz Dual core, 4GB Ram, 1GHz ethernet charge Dual core 3 GHz Intel Pentium 4 Processor, 3GB RAM, 1Gb ethernet orbit One 935MHz Intel Pentium III Processor, 256M RAM, 100Mb string One 1.6GHz Intel Pentium 4 Processor, 512M RAM, 100Mb fluxon One 2.67GHz Intel Pentium 4 Processor, 0.99GB RAM, Gb ethernet boson Two 2.8GHz Intel Pentium 4 Processors, 0.98GB RAM, 100Mb info Quad 2.7GHz Intel Pentium i5 Processors, 8GB RAM, 1Gb ethernet flory Two 3GHz Intel Intel(R) Pentium(R) D CPU 3.00GHz Processors, 1GB RAM, 100Mb
These rate fluctuations do not represent the actual clock accuracy, (in general chrony keeps the clocks to within a millisecond or less) but do represent the stability in the onboard system clock (driven from the bus frequency) and to some extent the real time clock. As chrony works, it measures the real time clock against the system clock, so an unstable system clock would produce an apparently unstable real time clock. In general the RTC seems to be more stable than is the system clock ( the correleated fluctuations in the system and RTC would suggest that a fair amount of the RTC instability comes from the system clock, rather than the RTC itselfi-- although this may be belied by the fact that the rate fluctuations for the rtc clock and the rate fluctuations for the system clock are very different in scale. )
To investigate whether the oscillations with about a 1.5 hour period in most of the chrony graphs are real fluctuations (eg caused by temperature fluctuations with a 1.5 hr scale) or are produced by the clock algorithm of chrony itself, I placed flory onto ntp as the client instead on Jan 19 2007. What is striking is how long it took the ntp client to come into sync. In the graph below a plot of the offsets, the rate set by ntp on startup. NOte that there was no drift file, so ntp had to figure out what the drift rate of the clock was on its own. But the behaviour was such that the clock, which was within less than a ms of the correct ntp time before the change to ntp and running at the correct rate (having been set by chrony), ntp caused the clock to rapidly go to a -20ms offset overcorrect to a 60ms offset and then take hours to finally get the clock back to about +5ms. Overnight, it went to a poll number of 10 (2^10 sec) and the switch between flory and string seems to have occasionally introduced 5-10ms delays ( about 20% of the time). However, ntp over the next 8 hours never managed to get the offset below 3-6ms. At daynumber 19.815 I restarted ntp with maxpoll 7 ( which was the same as I had run chrony at) and the offset now rapidly settled down to about 100usec., and began to have both positive and negative excusions. Ie, ntp seems to have a really hard time dealing with transient effects (like being started without a drift file). Chrony on the other had even with a one second initial offset settles down to a locked, minimal offset conditions in less than an hour.
If we compare the standard deviation of the offset produced by chrony over the week Jan 13.5-18.3 UTC with the standard deviation from ntp over Jan 20.0-Jan 21.88, I get
On the other hand, the mean rate and rate fluctuations (standard deviation) are
The mean time between measurements is 126.5s for chrony, and is 123.9 for NTP both at maxpoll 7 (If they really used the max accurately both would be 128 sec between ntp queries). Ie, the better offset control by chrony does not come at the expense of more measurements by chrony.
The one place that ntp seems to do significantly better than chrony is the
round trip time. For ntp the round trip time is 159usec with a standard
deviation of 6.6usec, while for chrony the average is 178usec with a
standard deviation of 28usec. Ie the standard deviation is four times
worse.
-Jan 24-- I have discovered both that ntp does even worse than stated
in both the round trip time and in the offset variance. NTP has a clock
filter algorithm which takes the shortest roundtrip of the last eight
samples and reports that as the round trip and also uses the offset
associated with that shortest as the offset. Thus there are many
repetitions ( usually of the order of 6 in a row). chrony's measurements
wee the actual measured offsets and round trips on each of the measurement
events.
i(Jan 24/08)--The ntp algorithm only submits the measurement for use in the clock control
algoritm if the most recent measurement is also the best of the last 8.
(actually the selection criterion is slightly more complicated as the
"round trip" used in the algorithm is equal to the actual round trip plus
the event number (most recent is 0) times the freq error (15PPM) times the
time since the last sample, or the freq times the last sample.
Further investigation seems to indicate that this is primarily due to
the higher priority that ntp runs at (ntpd sets its priority to -12 while
chrony was running at the default of 0) If I eliminate all round trips with
a delay of greater than .2ms, the standard deviation for chrony drops to 7usec,
the same as ntpd. Of course getting rid of those items for the statistics
does not eliminate their effect on the offset noise and on the clock
discipline. I am now running the chrony processes with a nice value of -12.
After ntp 4.2.4 was started, the mean round trip drops to 150usec and the
standard deviation to 5usec. This makes the higher offset control by chrony even more impressive.
However the measurements for ntp were done on the weekend, while those for
chrony were done during the week. (also see above about the niceness)
|
|
|
|
The reduced small scale noise in ntp is probably due to the fact that ntp throws away about 7/8 of the data points in the clock filter, and the exponential feedback form of the discipline. chrony uses all data points which satisfy minimal conditions, and uses a fit to the last n points, where n in this case is around 64 points. Also chrony drops its poll interval from maxpoll more readily than does ntp, so that the average poll time is about 70 sec for chrony and 120 for ntp, with a maxpoll time of 128s.
The response of ntp to the change in rate is slow. Linux in the more recent kernels has had a highly inconsistant calibration of the clocks, so that the drift rate changes by about 30-50PPM on each reboot. ntp must thus respond to this change. Here is a plot of the response of ntp to one such change.
|
|
|
|
|
|
|
|