Table of Contents
Fetching ...

FlexiNS: A SmartNIC-Centric, Line-Rate and Flexible Network Stack

Xuzheng Chen, Jie Zhang, Baolin Zhu, Xueying Zhu, Zhongqing Chen, Shu Ma, Lingjun Zhu, Chao Shi, Yin Zhang, Zeke Wang

TL;DR

FlexiNS tackles the gap between rapidly speeding NICs and CPU processing by deploying a SmartNIC-offload network stack with four key innovations: header-only TX offload, unlimited-working-set in-cache RX processing, a DMA-only notification pipe, and a programmable offloading engine. Implemented on Nvidia BlueField-3 with RDMA IBV compatibility, it maintains line-rate throughput while providing high programmability for transport and congestion-control policies. Empirical results show FlexiNS delivers up to $2.2$× higher IOPS in block storage and $1.3$× higher throughput in KVCache transfers compared with strong baselines, while keeping host overhead minimal. The design is DOCA-free and broadly compatible with existing RDMA workloads, offering cloud providers a practical path to flexible, high-performance data-plane programmability at scale.

Abstract

As the gap between network and CPU speeds rapidly increases, the CPU-centric network stack proves inadequate due to excessive CPU and memory overhead. While hardware-offloaded network stacks alleviate these issues, they suffer from limited flexibility in both control and data planes. Offloading network stack to off-path SmartNIC seems promising to provide high flexibility; however, throughput remains constrained by inherent SmartNIC architectural limitations. To this end, we design FlexiNS, a SmartNIC-centric network stack with software transport programmability and line-rate packet processing capabilities. To grapple with the limitation of SmartNIC-induced challenges, FlexiNS introduces: (a) a header-only offloading TX path; (b) an unlimited-working-set in-cache processing RX path; (c) a high-performance DMA-only notification pipe; and (d) a programmable offloading engine. We prototype FlexiNS using Nvidia BlueField-3 SmartNIC and provide out-of-the-box RDMA IBV verbs compatibility to users. FlexiNS achieves 2.2$\times$ higher throughput than the microkernel-based baseline in block storage disaggregation and 1.3$\times$ higher throughput than the hardware-offloaded baseline in KVCache transfer.

FlexiNS: A SmartNIC-Centric, Line-Rate and Flexible Network Stack

TL;DR

FlexiNS tackles the gap between rapidly speeding NICs and CPU processing by deploying a SmartNIC-offload network stack with four key innovations: header-only TX offload, unlimited-working-set in-cache RX processing, a DMA-only notification pipe, and a programmable offloading engine. Implemented on Nvidia BlueField-3 with RDMA IBV compatibility, it maintains line-rate throughput while providing high programmability for transport and congestion-control policies. Empirical results show FlexiNS delivers up to × higher IOPS in block storage and × higher throughput in KVCache transfers compared with strong baselines, while keeping host overhead minimal. The design is DOCA-free and broadly compatible with existing RDMA workloads, offering cloud providers a practical path to flexible, high-performance data-plane programmability at scale.

Abstract

As the gap between network and CPU speeds rapidly increases, the CPU-centric network stack proves inadequate due to excessive CPU and memory overhead. While hardware-offloaded network stacks alleviate these issues, they suffer from limited flexibility in both control and data planes. Offloading network stack to off-path SmartNIC seems promising to provide high flexibility; however, throughput remains constrained by inherent SmartNIC architectural limitations. To this end, we design FlexiNS, a SmartNIC-centric network stack with software transport programmability and line-rate packet processing capabilities. To grapple with the limitation of SmartNIC-induced challenges, FlexiNS introduces: (a) a header-only offloading TX path; (b) an unlimited-working-set in-cache processing RX path; (c) a high-performance DMA-only notification pipe; and (d) a programmable offloading engine. We prototype FlexiNS using Nvidia BlueField-3 SmartNIC and provide out-of-the-box RDMA IBV verbs compatibility to users. FlexiNS achieves 2.2 higher throughput than the microkernel-based baseline in block storage disaggregation and 1.3 higher throughput than the hardware-offloaded baseline in KVCache transfer.

Paper Structure

This paper contains 29 sections, 18 figures, 2 tables.

Figures (18)

  • Figure 1: Comparison of different network stack designs.
  • Figure 2: Comparison of different stacks achieved throughput and corresponding host memory bandwidth usage.
  • Figure 3: Memory-intensive application causes interference with the network stack.
  • Figure 4: High-level system architecture and TX/RX data flow of a naïve SmartNIC-centric network stack.
  • Figure 5: Architecture overview of FlexiNS.
  • ...and 13 more figures