During this season of giving, you can show your support for the NTP Project by making a donation to Network Time Foundation.


Experience and thoughts about Orphan Mode and configuration (6.2.)

Orphan Mode as described and configured https://support.ntp.org/bin/view/Support/OrphanMode can lead to some strange behavior with 3 or more hosts in the Orphan Mode group. It is presumed for this discussion that each host in the Orphan Mode group has at least one refclock or external server not in the Orphan mode group, and that any additional clients use Orphan mode group hosts as servers. For group hosts on a LAN and external servers connected over the Internet (e.g. pool servers), the LAN group hosts will typically have delay and jitter much lower than Internet hosts. When the LAN hosts are synchronized, the discrepancy in delay and jitter times will tend to cause the NTP clock selection algorithms to mark the Internet hosts as (at best) candidates or outliers, and LAN host peers will be preferred. N.B. This occurs even if all Internet hosts remain reachable in normal operation. If there are only 2 hosts in the Orphan mode group, there is no problem, as the two hosts cannot lock to each other; the algorithms prevent such a small loop. However, with 3 or more hosts, a loop can be formed, e.g. A->B->C->A, and those hosts will drift off as an Orphan Group, with each host's stratum gradually increasing until the orphan mode stratum is reached. That group will then continue to drift at the rate determined by the group leader's frequency. Again, this happens even in normal operation due to the discrepancy in LAN vs. Internet delay and jitter.

If there are many more Internet sources than Orphan Mode group hosts, after the group has drifted sufficiently far the selection algorithm will mark the group as outliers and revert to locking to the Internet hosts (either stepping or slewing depending on the accumulated error). The result is that the Orphan Group and any hosts using them as servers will drift in and out of synchronicity with ntp time (unless there are insufficient Internet hosts, in which case the Orphan Group may continue to drift indefinitely!).

There is no provision in ntpd to detect and avoid loops in a mesh of 3 or more hosts. As with ntptrace, it might not be possible to detect such a loop if some hosts do not make refid information available. Even if the information is available, the number of possible loops to be checked grows very quickly as the number of hosts in a mesh increases, so such checking might be impractical.

Loops can be avoided if instead of a mesh, the Orphan Mode group hosts are connected (for ntp peering purposes) in a star configuration rather than a full mesh. There is no possibilty then for a loop, and the clock selection algorithms continue to prevent any pair of hosts locking to each other. So long as some Internet server provides acceptable time per the ntp clock selection algorithms, the entire group should remain synchronized to ntp time. With the star configuration, the orphan mode stratum should be N+3 rather than N+2 as in a mesh configuration. In normal operation, the host at the hub of the star peer arrangement will typically lock to one of its (LAN) peers, due to the lower delay and jitter factors mentioned above.

Possible failure modes: If all external sources become unreachable (e.g. loss of Internet connectivity and no local refclocks), the Orphan mode host at the hub of the star configuration might lock to one of the others (which are all peered with it), or it might become the orphan mode leader (depending on IP addresses and which host last lost connectivity with external servers). All Orphan mode group hosts will have a stratum plus or minus 1 from the hub. Clients using Orphan mode servers will generally lock to the one with the lowest stratum. If the host at the hub is down (or ntp is not running), unlike the mesh arrangement, the group hosts can no longer peer with the only other group host that they peer with, viz. the hub host, and the group disintegrates. Clients can continue to use the remaining "group" hosts (which are presumed to still be synchronized with external servers or refclocks). In the unlikely case of loss of all references (external servers and refclocks) and loss of the hub host, there is no Orphan mode group; the individual hosts may drift (until the hub or external sources are restored), and clients will use the usual clock selection algorithm to select a reference.

Neither situation is ideal; the mesh arrangement (even with full connectivity) results in group and client time drifting in and out of sync with ntp time, and the hub arrangement has an additional possibility of failure if all external and refclock connectivity is lost and the hub host is down (if prolonged downtime is expected, the group could be reconfigured).

For a large number of orphan group hosts, a binary tree arrangement of peers could be used; in that case failure of any host would split the remaining hosts into at most 3 groups (a ternary tree might split into as many as 4 groups, etc.). In that case the orphan mode stratum would be based on N plus the maximum path length between hosts (typically twice the tree depth). In normal operation, interior nodes in the peer connections would most likely be locked to one of their peers due to delay and jitter issues.

A modest number of hosts could be connected as a linear sequence of peers. Failure of any one host would split the remaining hosts into at most two groups. Orphan mode stratum would be N plus the number of hosts in the group. In normal operation, group hosts not at either end of the line would likely synchronize to other group (LAN) host peers.

In summary:
  1. Use of precisely two hosts in an orphan mode group should pose few special problems. If one set of the external sources are themselves out of synchronization with ntp time, there may be some clock hopping.
  2. The recommended mesh configuration can result in drift in and out of synchronization with ntp time with 3 or more hosts, even with normal connectivity, due to peering loops and delay/jitter discrepancies between LAN and external servers. In some cases, the group may drift away from synchronization indefinitely.
  3. A star/hub peering configuration can avoid the loops that cause loss of synchronization (with a mesh) in normal operation. However it is susceptible to disintegration of the group unless the hub host is reliable.
  4. A tree arrangement for peering avoids a single point of failure for group disintegration, but it is only suitable for large groups, and the orphan mode stratum may be rather high.
  5. Linear peering also avoids a single point of failure for group disintegration, but it is impractical for large groups due to the high orphan mode stratum. Note that for precisely 3 hosts, the star, tree, and line configurations are identical.

History: With local GPS refclocks and pool server as backup, 3-4 hosts in a mesh (using separate pool groups) worked fine, as refclock delay is zero and jitter is low; each host locked to its refclock. During GPS week 2024 (late October 2018), buggy GPS firmware (Garmin 17xHVS) started reporting a date in year xx99 (correctable via refclock driver using GPS week, seconds, and leap seconds to compute the correct date) and exhibited unusably high PPS jitter (not correctable). Disconnection of the GPS refclocks pending a firmware update to resolve the problems, the orphan group loops were detected as the root cause of timing drifting in and out of sync with ntp time with the Internet pool servers. Reconfiguring peering as a star or line (4 hosts is too few for a tree) resolved the problem.

-- BruceLilly - 2018-11-02
Topic revision: r4 - 04 Nov 2018, BruceLilly
Copyright © by the contributing authors.Use of this website indicates your agreement with, and acceptance of, the PrivacyPolicy, the WikiDisclaimer, and the PrivateWebPolicy.