Table of Contents
Fetching ...

Demystifying Datapath Accelerator Enhanced Off-path SmartNIC

Xuzheng Chen, Jie Zhang, Ting Fu, Yifan Shen, Shu Ma, Kun Qian, Lingjun Zhu, Chao Shi, Yin Zhang, Ming Liu, Zeke Wang

TL;DR

This work addresses the need for line-rate, programmable datapath processing on SmartNICs by characterizing NVIDIA BlueField-3’s datapath accelerator (DPA). It conducts an architectural study across compute and memory subsystems, comparing DPA to host CPUs and Arm cores, and identifies three architectural characteristics that can be leveraged by targeted workloads. Three case studies—clock synchronization, network function virtualization, and key-value aggregation—demonstrate practical guidelines for exploiting DPA: offload latency-sensitive and easy-to-parallelize tasks, carefully manage working-set sizes to fit DPA caches, and selectively buffer memory across DPA, Arm, and host memories. The findings show substantial performance potential (e.g., up to multi-fold improvements in KV aggregation and tighter latency bounds) and offer concrete vendor and programmer recommendations to harness DPA benefits despite DPA’s weaker per-thread performance. Overall, the paper provides a framework for using a programmable DPA to complement host resources in off-path SmartNICs, guiding future design and software strategies in cloud environments.

Abstract

Network speeds grow quickly in the modern cloud, so SmartNICs are introduced to offload network processing tasks, even application logic. However, typical multicore SmartNICs such as BlueFiled-2 are only capable of processing control-plane tasks with their embedded processors that have limited memory bandwidth and computing power. On the other hand, cloud applications evolve rapidly, such that a limited number of fixed hardware engines in a SmartNIC cannot satisfy the requirements of cloud applications. Therefore, SmartNIC programmers call for a programmable datapath accelerator (DPA) to process network traffic at line rate. However, no existing work has unveiled the performance characteristics of the existing DPA. To this end, we present the first architectural characterization of the latest DPA-enhanced BlueFiled-3 (BF3) SmartNIC. Our evaluation results indicate that BF3's DPA is significantly wimpier than the off-path Arm processor and the host CPU. However, we still identify that DPA has three unique architectural characteristics that unleash the performance potential of DPA. Specifically, we demonstrate how to take advantage of DPA's three architectural characteristics regarding computing, networking, and memory subsystems. Then we propose three important guidelines for programmers to fully unleash the potential of DPA. To demonstrate the effectiveness of our approach, we conduct detailed case studies regarding each guideline. Our case study on key-value aggregation achieves up to 4.3$\times$ higher throughput by using our guidelines to optimize memory combinations.

Demystifying Datapath Accelerator Enhanced Off-path SmartNIC

TL;DR

This work addresses the need for line-rate, programmable datapath processing on SmartNICs by characterizing NVIDIA BlueField-3’s datapath accelerator (DPA). It conducts an architectural study across compute and memory subsystems, comparing DPA to host CPUs and Arm cores, and identifies three architectural characteristics that can be leveraged by targeted workloads. Three case studies—clock synchronization, network function virtualization, and key-value aggregation—demonstrate practical guidelines for exploiting DPA: offload latency-sensitive and easy-to-parallelize tasks, carefully manage working-set sizes to fit DPA caches, and selectively buffer memory across DPA, Arm, and host memories. The findings show substantial performance potential (e.g., up to multi-fold improvements in KV aggregation and tighter latency bounds) and offer concrete vendor and programmer recommendations to harness DPA benefits despite DPA’s weaker per-thread performance. Overall, the paper provides a framework for using a programmable DPA to complement host resources in off-path SmartNICs, guiding future design and software strategies in cloud environments.

Abstract

Network speeds grow quickly in the modern cloud, so SmartNICs are introduced to offload network processing tasks, even application logic. However, typical multicore SmartNICs such as BlueFiled-2 are only capable of processing control-plane tasks with their embedded processors that have limited memory bandwidth and computing power. On the other hand, cloud applications evolve rapidly, such that a limited number of fixed hardware engines in a SmartNIC cannot satisfy the requirements of cloud applications. Therefore, SmartNIC programmers call for a programmable datapath accelerator (DPA) to process network traffic at line rate. However, no existing work has unveiled the performance characteristics of the existing DPA. To this end, we present the first architectural characterization of the latest DPA-enhanced BlueFiled-3 (BF3) SmartNIC. Our evaluation results indicate that BF3's DPA is significantly wimpier than the off-path Arm processor and the host CPU. However, we still identify that DPA has three unique architectural characteristics that unleash the performance potential of DPA. Specifically, we demonstrate how to take advantage of DPA's three architectural characteristics regarding computing, networking, and memory subsystems. Then we propose three important guidelines for programmers to fully unleash the potential of DPA. To demonstrate the effectiveness of our approach, we conduct detailed case studies regarding each guideline. Our case study on key-value aggregation achieves up to 4.3 higher throughput by using our guidelines to optimize memory combinations.
Paper Structure (22 sections, 17 figures, 2 tables)

This paper contains 22 sections, 17 figures, 2 tables.

Figures (17)

  • Figure 1: On-path and Off-path SmartNICs.
  • Figure 2: BlueField-3 SmartNIC architecture.
  • Figure 3: Cache-aware Roofline Model for different general-purpose computing power.
  • Figure 4: DPA accesses three memory types.
  • Figure 5: Cache latency for all computer resources.
  • ...and 12 more figures