Practical Rateless Set Reconciliation

Lei Yang; Yossi Gilad; Mohammad Alizadeh

Practical Rateless Set Reconciliation

Lei Yang, Yossi Gilad, Mohammad Alizadeh

TL;DR

The paper tackles the inefficiency of existing set reconciliation methods by introducing Rateless Invertible Bloom Lookup Tables (Rateless IBLT), a rateless, universal encoder that streams coded symbols encoding the set difference without requiring prior knowledge of the difference size. Through a carefully designed mapping probability $\rho(i)=\frac{1}{1+\alpha i}$ (with $\alpha=0.5$) and a closed-form sampling method, Rateless IBLT achieves decodability via a peeling decoder with an average communication overhead that converges to $1.35$ coded symbols per difference as $d$ grows, while maintaining low computation costs ($O(\ell\log d)$ per item). The authors provide rigorous density-evolution analysis, Monte Carlo validation, a compact Go implementation, and extensive evaluation against state-of-the-art schemes, demonstrating substantial reductions in communication and computation across large and small differences, including practical Ethereum state synchronization advantages. They also explore irregular Rateless IBLTs to further reduce overhead, trading some speed for improved efficiency. The practical impact is notable for distributed systems requiring scalable, low-latency state reconciliation across peers, with real-world benefits demonstrated on Ethereum ledger synchronization and potential applicability to blockchains and large-scale replicated services.

Abstract

Set reconciliation, where two parties hold fixed-length bit strings and run a protocol to learn the strings they are missing from each other, is a fundamental task in many distributed systems. We present Rateless Invertible Bloom Lookup Tables (Rateless IBLT), the first set reconciliation protocol, to the best of our knowledge, that achieves low computation cost and near-optimal communication cost across a wide range of scenarios: set differences of one to millions, bit strings of a few bytes to megabytes, and workloads injected by potential adversaries. Rateless IBLT is based on a novel encoder that incrementally encodes the set difference into an infinite stream of coded symbols, resembling rateless error-correcting codes. We compare Rateless IBLT with state-of-the-art set reconciliation schemes and demonstrate significant improvements. Rateless IBLT achieves 3--4x lower communication cost than non-rateless schemes with similar computation cost, and 2--2000x lower computation cost than schemes with similar communication cost. We show the real-world benefits of Rateless IBLT by applying it to synchronize the state of the Ethereum blockchain, and demonstrate 5.6x lower end-to-end completion time and 4.4x lower communication cost compared to the system used in production.

Practical Rateless Set Reconciliation

TL;DR

(with

) and a closed-form sampling method, Rateless IBLT achieves decodability via a peeling decoder with an average communication overhead that converges to

coded symbols per difference as

grows, while maintaining low computation costs (

per item). The authors provide rigorous density-evolution analysis, Monte Carlo validation, a compact Go implementation, and extensive evaluation against state-of-the-art schemes, demonstrating substantial reductions in communication and computation across large and small differences, including practical Ethereum state synchronization advantages. They also explore irregular Rateless IBLTs to further reduce overhead, trading some speed for improved efficiency. The practical impact is notable for distributed systems requiring scalable, low-latency state reconciliation across peers, with real-world benefits demonstrated on Ethereum ledger synchronization and potential applicability to blockchains and large-scale replicated services.

Abstract

Paper Structure (21 sections, 9 theorems, 18 equations, 16 figures)

This paper contains 21 sections, 9 theorems, 18 equations, 16 figures.

Introduction
Motivation and Related Work
Background
Design
Coded Symbol Sequence
Linearity & Universality
Decodability
Realizing the Mapping Probability
Resistance to Malicious Workload
Analysis
Monte Carlo Simulations
Implementation
Evaluation
Communication Cost
Computation Cost
...and 6 more sections

Key Result

lemma 1

For any $\epsilon > 0$, any mapping probability $\rho(i)$ such that $\rho(i) = \Omega\left(1/i^{1-\epsilon}\right)$, and any $\sigma > 0$, if there exists at least one pure coded symbol within the first $m$ coded symbols for a random set $S$ with probability $\sigma$, then $m = \omega(|S|)$.

Figures (16)

Figure 1: Example of constructing a regular IBLT for set $A$ with source symbols $x_0, x_1, x_2, x_3$. The IBLT has $m=6$ coded symbols: $a_0, a_1, a_2, a_3, a_4, a_5$. Each source symbol is mapped to $k=3$ coded symbols. Solid lines represent the mapping between source and coded symbols. For example, for $a_4$, $\mathtt{sum}=x_1\oplus x_3$, $\mathtt{checksum}=\mathtt{Hash}(x_1)\oplus \mathtt{Hash}(x_3)$, and $\mathtt{count}=2$. $\oplus$ is the bitwise exclusive-or operator.
Figure 2: Example of decoding the IBLT in Fig. \ref{['fig:iblt-example']} using peeling. Dark colors represent pure coded symbols at the beginning of each iteration, and source symbols recovered so far. Dashed edges are removed at the end of each iteration, by XOR-ing the source symbol (now recovered) and its hash on one end of the edge into the sum and checksum fields of the coded symbol on the other end.
Figure 3: Regular IBLTs and prefixes of Rateless IBLT for $5$ source symbols. Figs. a, c (left) have too few coded symbols and are undecodable. Figs. b, d (right) are decodable. Red edges are common across each row. Dark coded symbols in Figs. b, d are new or changed compared to their counterparts in Figs. a, c. Imagine that Alice sends $4$ coded symbols but Bob fails to decode. In regular IBLT, in order to enlarge $m$, she has to send all $7$ coded symbols since the existing $4$ symbols also changed. In Rateless IBLT, she only needs to send the $3$ new symbols. The existing $4$ symbols stay the same.
Figure 4: Relationship between the communication overhead $\eta^*$ and the parameter $\alpha$ in $\rho(i)$. "DE" shows results from the density evolution analysis which assumes the difference size goes to infinity. Points show results from Monte Carlo simulations for various finite difference sizes. Each point is the average over 100 runs.
Figure 5: Overhead of Rateless IBLTs at varying difference sizes $d$. We run 100 simulations for each data point and report the average. The shaded area shows the standard deviation. The dashed line shows $1.35$, the overhead predicted by density evolution.
...and 11 more figures

Theorems & Definitions (9)

lemma 1
lemma 2
theorem 1
corollary 1
theorem 2
theorem 3
lemma 3: Restatement of Lemma \ref{['lemma1']}
lemma 4: Restatement of Lemma \ref{['lemma2']}
theorem 4: Restatement of Theorem \ref{['detheorem']}

Practical Rateless Set Reconciliation

TL;DR

Abstract

Practical Rateless Set Reconciliation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (9)