FlexiNS: A SmartNIC-Centric, Line-Rate and Flexible Network Stack
Xuzheng Chen, Jie Zhang, Baolin Zhu, Xueying Zhu, Zhongqing Chen, Shu Ma, Lingjun Zhu, Chao Shi, Yin Zhang, Zeke Wang
TL;DR
FlexiNS tackles the gap between rapidly speeding NICs and CPU processing by deploying a SmartNIC-offload network stack with four key innovations: header-only TX offload, unlimited-working-set in-cache RX processing, a DMA-only notification pipe, and a programmable offloading engine. Implemented on Nvidia BlueField-3 with RDMA IBV compatibility, it maintains line-rate throughput while providing high programmability for transport and congestion-control policies. Empirical results show FlexiNS delivers up to $2.2$× higher IOPS in block storage and $1.3$× higher throughput in KVCache transfers compared with strong baselines, while keeping host overhead minimal. The design is DOCA-free and broadly compatible with existing RDMA workloads, offering cloud providers a practical path to flexible, high-performance data-plane programmability at scale.
Abstract
As the gap between network and CPU speeds rapidly increases, the CPU-centric network stack proves inadequate due to excessive CPU and memory overhead. While hardware-offloaded network stacks alleviate these issues, they suffer from limited flexibility in both control and data planes. Offloading network stack to off-path SmartNIC seems promising to provide high flexibility; however, throughput remains constrained by inherent SmartNIC architectural limitations. To this end, we design FlexiNS, a SmartNIC-centric network stack with software transport programmability and line-rate packet processing capabilities. To grapple with the limitation of SmartNIC-induced challenges, FlexiNS introduces: (a) a header-only offloading TX path; (b) an unlimited-working-set in-cache processing RX path; (c) a high-performance DMA-only notification pipe; and (d) a programmable offloading engine. We prototype FlexiNS using Nvidia BlueField-3 SmartNIC and provide out-of-the-box RDMA IBV verbs compatibility to users. FlexiNS achieves 2.2$\times$ higher throughput than the microkernel-based baseline in block storage disaggregation and 1.3$\times$ higher throughput than the hardware-offloaded baseline in KVCache transfer.
