Clock Distribution with Gradient TRIX

Christoph Lenzen; Shreyas Srinivas

Clock Distribution with Gradient TRIX

Christoph Lenzen, Shreyas Srinivas

TL;DR

This work tackles gradient clock synchronization in large, low-degree networks by introducing a self-stabilizing, grid-aware GCS that requires only $3$-ary in/out degrees and tolerates a single faulty in-neighbor. It discretizes the gradient clock mechanism across layers (the Gradient TRIX scheme), enforcing robust timing via median-based corrections and Slow/Fast/Jump conditions, while handling slow-varying delays and clock drifts. The key contributions are a near-minimal-degree, fault-resilient architecture achieving local skew $Θ(\log D)$ with high probability under sparse, average-case faults, and a self-stabilizing pulse-forwarding scheme that recovers from transient faults in $O(\sqrt{n})$ pulses. These results significantly advance clock distribution for large synchronous System-on-Chip (SoC) fabrics by reducing replication overhead and enabling scalable, robust timing in hardware.

Abstract

Gradient clock synchronization (GCS) algorithms minimize the worst-case clock offset between the nodes in a distributed network of diameter $D$ and size $n$. They achieve optimal offsets of $Θ(\log D)$ locally, i.e. between adjacent nodes as shown by Lenzen et al., and $Θ(D)$ globally as shown by Biaz and Welch. As demonstrated in the work of Bund et al., this is a highly promising approach for improved clocking schemes for large-scale synchronous Systems-on-Chip (SoC). Unfortunately, in large systems, faults hinder their practical use. State of the art fault-tolerant, as presented by Bund et al., has a drawback that is fatal in this setting: It relies on node and edge replication. For $f=1$, this translates to at least $16$-fold edge replication and high degree nodes, far from the optimum of $2f+1=3$ for tolerating up to $f$ faulty neighbors. In this work, we present a self-stabilizing GCS algorithm for a grid-like directed graph with optimal node in- and out-degrees of $3$ that tolerates $1$ faulty in-neighbor. If nodes fail with independent probability $p\in o(n^{-1/2})$, it achieves asymptotically optimal local skew of $Θ(\log D)$ with probability $1-o(1)$; this holds under general worst-case assumptions on link delay and clock speed variations, provided they change slowly relative to the speed of the system. The failure probability is the largest possible ensuring that with probabity $1-o(1)$ for each node at most one in-neighbor fails. As modern hardware is clocked at gigahertz speeds and the algorithm can simultaneously sustain a constant number of arbitrary changes due to faults in each clock cycle, this results in sufficient robustness to dramatically increase the size of reliable synchronously clocked SoCs.

Clock Distribution with Gradient TRIX

TL;DR

This work tackles gradient clock synchronization in large, low-degree networks by introducing a self-stabilizing, grid-aware GCS that requires only

-ary in/out degrees and tolerates a single faulty in-neighbor. It discretizes the gradient clock mechanism across layers (the Gradient TRIX scheme), enforcing robust timing via median-based corrections and Slow/Fast/Jump conditions, while handling slow-varying delays and clock drifts. The key contributions are a near-minimal-degree, fault-resilient architecture achieving local skew

with high probability under sparse, average-case faults, and a self-stabilizing pulse-forwarding scheme that recovers from transient faults in

pulses. These results significantly advance clock distribution for large synchronous System-on-Chip (SoC) fabrics by reducing replication overhead and enabling scalable, robust timing in hardware.

Abstract

Gradient clock synchronization (GCS) algorithms minimize the worst-case clock offset between the nodes in a distributed network of diameter

and size

. They achieve optimal offsets of

locally, i.e. between adjacent nodes as shown by Lenzen et al., and

globally as shown by Biaz and Welch. As demonstrated in the work of Bund et al., this is a highly promising approach for improved clocking schemes for large-scale synchronous Systems-on-Chip (SoC). Unfortunately, in large systems, faults hinder their practical use. State of the art fault-tolerant, as presented by Bund et al., has a drawback that is fatal in this setting: It relies on node and edge replication. For

, this translates to at least

-fold edge replication and high degree nodes, far from the optimum of

for tolerating up to

faulty neighbors. In this work, we present a self-stabilizing GCS algorithm for a grid-like directed graph with optimal node in- and out-degrees of

that tolerates

faulty in-neighbor. If nodes fail with independent probability

, it achieves asymptotically optimal local skew of

with probability

; this holds under general worst-case assumptions on link delay and clock speed variations, provided they change slowly relative to the speed of the system. The failure probability is the largest possible ensuring that with probabity

for each node at most one in-neighbor fails. As modern hardware is clocked at gigahertz speeds and the algorithm can simultaneously sustain a constant number of arbitrary changes due to faults in each clock cycle, this results in sufficient robustness to dramatically increase the size of reliable synchronously clocked SoCs.

Paper Structure (14 sections, 49 theorems, 110 equations, 5 figures, 1 table, 4 algorithms)

This paper contains 14 sections, 49 theorems, 110 equations, 5 figures, 1 table, 4 algorithms.

Introduction
Modeling
Algorithm
Simplified Pulse Forwarding Algorithm
Analysis
The Slow, Fast, and Jump Conditions
Bounding Psi in the Absence of Faults
Bounding Skews in the Presence of Faults
Obtaining the Final Skew Bounds
Generating Synchronized Inputs
Full Pulse Forwarding Algorithm
Self-Stabilization
Algorithm Modification:
Basic Statements

Key Result

Theorem 1.1

If there are no faults, then $\mathcal{L}_{\ell}\le 4\kappa (2+\log D)$ for all $\ell\in \mathbb{N}$.

Figures (5)

Figure 1: TRIX lenzen20trix (left) and HEX Dolev2016a (right) grids. TRIX uses the naive pulse forwarding scheme of waiting for the second copy of each pulse before forwarding it. We see how the TRIX grid can accumulate a local skew of $\Theta(uD)$ in layer $D$. In the HEX grid, each node waits for two copies of a pulse from in-neighbors. However, $2$ of the $4$ in-neighbors are on the same layer, causing a skew of $d$ if a neighbor on the preceding layer crashes.
Figure 2: Base graph $H$ used in this work. Rather than using a cycle, which would result in a TRIX grid, we replicate the end nodes of a line to ensure a minimum degree of $2$. Alternatively, one could use a line and exploit that the probability that one of the $O(\sqrt{n})$ boundary nodes fails is $o(1)$.
Figure 3: Layer structure of $G$ resulting from our choice of $H$. Most nodes have in- and out-degree $3$, some $4$.
Figure 4: Slow condition (left) and fast condition (right). $\operatorname{\text{SC}}(s)$ is tailored to ensuring that $\max_{w\in V}\{\psi_{v,w}^s(\ell)\}$ (the length of the green arrow) cannot grow quickly. Nodes $w$ with $\mathcal{C}_{w,\ell}\le 0$ ($\operatorname{\text{SC-3}}$ holds) cannot apply a correction pushing them below the red line. If $\mathcal{C}_{w,\ell}>0$, then both $\operatorname{\text{SC-1}}$ and $\operatorname{\text{SC-2}}$ will ensure that there is a neighbor $x$ of $w$ such that the offset of $t_{w,\ell-1}-\mathcal{C}_{w,\ell}/\vartheta$ to the black line does not exceed the one of $t_{x,\ell-1}$. In other words, $\operatorname{\text{SC}}$ ensures that the blue arrows indicating $\mathcal{C}_{w,\ell}/\vartheta$ do not reach below the red line. This means that any increase of $\max_{w\in V}\{\psi_{v,w}^s(\ell)\}$ is caused by delay and clock speed variation, which in turn is bounded by $\kappa/2$ per layer. Similarly, $\operatorname{\textbf{FC}}(s)$ is tailored to ensuring that $\max_{v\in V}\{\xi_{v,w}^s(\ell)\}$ (the length of the green arrow), if positive, decreases by at least $\kappa/2$. To ensure this, $\mathcal{C}_{w,\ell}$ (indicated by blue arrows) must be large enough to reach below the red line. This is achieved by $\operatorname{\textbf{FC}}(s)$ having an additional "slack" term of $\kappa$, which overcomes the "loss" of $\kappa/2$ due to uncertainty.
Figure 5: On the left, it is shown how skews increase without $\operatorname{\text{JC}}$. While $\operatorname{\text{SC}}(0)$ disallows that $(v,\ell)$ speeds up its pulse by more than the equivalent of $(v,\ell-1)$ matching the earliest pulse of any $(w,\ell-1)$, $\{v,w\}\in E$, $\operatorname{\textbf{FC}}$ permits that a node $(v,\ell)$ with slow $(v,\ell-1)$ to "overshoot," i.e., $\mathcal{C}_{v,\ell}$ (shown as blue arrow) gets large. This results in an amplifying oscillatory behavior. On the right, the same scenario is shown with $\operatorname{\text{JC}}$ in effect. $\operatorname{\text{JC}}$ forces the corrections to stop $\kappa$ before the earliest or latest neighbor, respectively, resulting in a dampened oscillation.

Theorems & Definitions (104)

Theorem 1.1
Theorem 1.2
Theorem 1.3
Theorem 1.4
Corollary 1.4
Theorem 1.5
Definition 4.1: Potential Functions
proof
Definition 4.3: Slow Condition
Definition 4.4: Fast Condition
...and 94 more

Clock Distribution with Gradient TRIX

TL;DR

Abstract

Clock Distribution with Gradient TRIX

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (104)