State-Compute Replication: Parallelizing High-Speed Stateful Packet Processing
Qiongwen Xu, Sebastiano Miano, Xiangyu Gao, Tao Wang, Adithya Murugadass, Songyuan Zhang, Anirudh Sivaraman, Gianni Antichi, Srinivas Narayana
TL;DR
This work tackles the bottleneck of scaling stateful, high-speed packet processing beyond a single CPU core by introducing State-Compute Replication (SCR). SCR replicates both state and computation across cores and uses a packet history sequencer to piggyback a bounded history on packets, enabling correct, lock-free processing without cross-core synchronization. Empirical evaluation across real and synthetic traces demonstrates linear, deterministic throughput scaling with the number of cores for multiple stateful programs, and hardware analyses show practical, low-cost sequencer implementations on NICs and Top-of-Rack switches. The approach offers a path to sustaining high packet-processing rates as NIC speeds continue to rise, and it provides concrete designs and a roadmap for deployment in data-center and wide-area networks.
Abstract
With the slowdown of Moore's law, CPU-oriented packet processing in software will be significantly outpaced by emerging line speeds of network interface cards (NICs). Single-core packet-processing throughput has saturated. We consider the problem of high-speed packet processing with multiple CPU cores. The key challenge is state--memory that multiple packets must read and update. The prevailing method to scale throughput with multiple cores involves state sharding, processing all packets that update the same state, i.e., flow, at the same core. However, given the heavy-tailed nature of realistic flow size distributions, this method will be untenable in the near future, since total throughput is severely limited by single core performance. This paper introduces state-compute replication, a principle to scale the throughput of a single stateful flow across multiple cores using replication. Our design leverages a packet history sequencer running on a NIC or top-of-the-rack switch to enable multiple cores to update state without explicit synchronization. Our experiments with realistic data center and wide-area Internet traces shows that state-compute replication can scale total packet-processing throughput linearly with cores, deterministically and independent of flow size distributions, across a range of realistic packet-processing programs.
