State-Compute Replication: Parallelizing High-Speed Stateful Packet Processing

Qiongwen Xu; Sebastiano Miano; Xiangyu Gao; Tao Wang; Adithya Murugadass; Songyuan Zhang; Anirudh Sivaraman; Gianni Antichi; Srinivas Narayana

State-Compute Replication: Parallelizing High-Speed Stateful Packet Processing

Qiongwen Xu, Sebastiano Miano, Xiangyu Gao, Tao Wang, Adithya Murugadass, Songyuan Zhang, Anirudh Sivaraman, Gianni Antichi, Srinivas Narayana

TL;DR

This work tackles the bottleneck of scaling stateful, high-speed packet processing beyond a single CPU core by introducing State-Compute Replication (SCR). SCR replicates both state and computation across cores and uses a packet history sequencer to piggyback a bounded history on packets, enabling correct, lock-free processing without cross-core synchronization. Empirical evaluation across real and synthetic traces demonstrates linear, deterministic throughput scaling with the number of cores for multiple stateful programs, and hardware analyses show practical, low-cost sequencer implementations on NICs and Top-of-Rack switches. The approach offers a path to sustaining high packet-processing rates as NIC speeds continue to rise, and it provides concrete designs and a roadmap for deployment in data-center and wide-area networks.

Abstract

With the slowdown of Moore's law, CPU-oriented packet processing in software will be significantly outpaced by emerging line speeds of network interface cards (NICs). Single-core packet-processing throughput has saturated. We consider the problem of high-speed packet processing with multiple CPU cores. The key challenge is state--memory that multiple packets must read and update. The prevailing method to scale throughput with multiple cores involves state sharding, processing all packets that update the same state, i.e., flow, at the same core. However, given the heavy-tailed nature of realistic flow size distributions, this method will be untenable in the near future, since total throughput is severely limited by single core performance. This paper introduces state-compute replication, a principle to scale the throughput of a single stateful flow across multiple cores using replication. Our design leverages a packet history sequencer running on a NIC or top-of-the-rack switch to enable multiple cores to update state without explicit synchronization. Our experiments with realistic data center and wide-area Internet traces shows that state-compute replication can scale total packet-processing throughput linearly with cores, deterministically and independent of flow size distributions, across a range of realistic packet-processing programs.

State-Compute Replication: Parallelizing High-Speed Stateful Packet Processing

TL;DR

Abstract

Paper Structure (21 sections, 2 theorems, 1 equation, 12 figures, 4 tables, 1 algorithm)

This paper contains 21 sections, 2 theorems, 1 equation, 12 figures, 4 tables, 1 algorithm.

Introduction
Background and Motivation
High Speed Packet Processing
Parallelizing Stateful Packet Processing
Goals
State-Compute Replication (SCR)
Scaling Principles
Operationalizing SCR
Packet History Sequencer
Packet format
Hardware data structures for packet history
Packet Loss and Nondeterminism
Evaluation
Experiment Setup
Multi-Core Throughput Scaling
...and 6 more sections

Key Result

Theorem 1

For any regular packet $p_i$, if every core has received $p_j$ ($j \ge i$), every core will finish processing $p_i$.

Figures (12)

Figure 1: Scaling the throughput of a TCP connection state tracker for a single TCP connection across multiple cores. Sharing state across cores degrades performance beyond 2 cores due to contention. Sharding state (using RSS and RSS++ rss++-conext19) cannot improve throughput beyond a single CPU core (§\ref{['sec:motivation-background']}). In contrast, State-Compute Replication (§\ref{['sec:design']}) provides linear scale-up in throughput with cores.
Figure 2: The nature of CPU work in high-speed packet processing: Consider the throughput of a simple packet forwarding application (packets/second (a), bits/second (b)) running on a single CPU core clocked at 3.6 GHz, as the size of the incoming packets varies. The average latency to execute the XDP program is also shown in nanoseconds (c). CPU usage is tied to the number of packets (not bits) processed per second. Further, significant time elapses in dispatch: CPU work to present the input packet to and retrieve the output packet from the program computation.
Figure 3: An example illustrating the scaling principles. $p_i$ is the $i^{th}$ packet received by the sequencer, $f(p_j)$ are relevant fields from $p_j$, and $S_i$ is the state after processing packets $p_1, ..., p_i$ in order.
Figure 4: Hardware data structures. (a) Packets modified to propagate history from the sequencer to CPU cores. The sequencer prefixes the packet history to the original packet, which allows for a simpler implementation in hardware (§\ref{['sec:pipeline-sequencer']}) and simpler transformations to make a packet-processing program SCR-aware (App.\ref{['app:scr-programming']}). In instantiations where the sequencer is partly implemented on a top-of-the-rack switch (§\ref{['sec:operationalizing-scr']}), we further prefix a dummy Ethernet header to ensure that the NIC can process the packet correctly. (b) The data structure used to maintain and propagate packet history on the Tofino programmable switch pipeline (§\ref{['sec:data-structure-packet-history']}). Inset shows the specific actions performed on each Tofino register. (c) The data structure used to maintain and propagate packet history on our Verilog module integrated into NetFPGA-PLUS (§\ref{['sec:data-structure-packet-history']}).
Figure 5: Flow size distributions of the packet traces we used. We used real packet traces captured at (a) university data center microsoft-network-sigcomm10 and (b) wide-area Internet backbone by CAIDA caida. We also synthesized (c) a packet trace with real TCP flows whose sizes are drawn from Microsoft's data center flow size distribution dctcp-sigcomm10.
...and 7 more figures

Theorems & Definitions (4)

Theorem 1
proof
Lemma 1
proof

State-Compute Replication: Parallelizing High-Speed Stateful Packet Processing

TL;DR

Abstract

State-Compute Replication: Parallelizing High-Speed Stateful Packet Processing

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (4)