Plug & Offload: Transparently Offloading TCP Stack onto Off-path SmartNIC with PnO-TCP
Hailong Nan, Zhe Zhou, Min Yang
TL;DR
This work presents Plug & Offload (PnO), a transparent approach to offload the entire TCP stack onto off-path SmartNIC DPUs using PnO-TCP, a lightweight user-space stack that runs across a host–DPU pair. By introducing the PnO-Shim for automatic API redirection and a dual-component PnO-TCP (host proxy and NIC bridge with a zero-copy, ring-based communication model), the approach achieves substantial host CPU savings and notable throughput gains for small packets, demonstrated with Redis, Lighttpd, HAProxy, and Echo workloads on a BlueField-3 DPU. Key contributions include full TCP offload without application changes, a detailed host–DPU communication architecture (S-type/G-type rings), and extensive evaluation of performance and CPU utilization under real-world traffic. The results indicate strong potential for scalable data-center networking, albeit with challenges related to PCIe DMA latency, DPU memory bandwidth, and jitter, which the authors address in discussion and outline for future hardware/software co-designs.
Abstract
Host CPU resources are heavily consumed by TCP stack processing, limiting scalability in data centers. Existing offload methods typically address only partial functionality or lack flexibility. This paper introduces PnO (Plug & Offload), an approach to fully offload TCP processing transparently onto off-path SmartNICs (NVIDIA BlueField DPUs). Key to our solution is PnO-TCP, a novel TCP stack specifically designed for efficient execution on the DPU's general-purpose cores, panning both the host and the SmartNIC to facilitate the offload. PnO-TCP leverages a lightweight, user-space stack based on DPDK, achieving high performance despite the relatively modest computational power of off-path SmartNIC cores. Our evaluation, using real-world applications (Redis, Lighttpd, and HAProxy), demonstrates that PnO achieves transparent TCP stack offloading, leading to both substantial reductions in host CPU usage and, in many cases, significant performance improvements, particularly for small packet scenarios (< 2KB) where RPS gains of 34%-127% were observed in single-threaded tests. Our evaluation, using real-world applications (Redis, Lighttpd, and HAProxy), demonstrates that PnO achieves transparent TCP stack offloading, leading to both substantial reductions in host CPU usage and, in many cases, significant performance improvements, particularly for small packet scenarios (< 2KB) where RPS gains of 34%-127% were observed in single-threaded tests.
