Table of Contents
Fetching ...

Disaggregated Memory with SmartNIC Offloading: a Case Study on Graph Processing

Jacob Wahlgren, Gabin Schieffer, Maya Gokhale, Roger Pearce, Ivy Peng

TL;DR

This work provides a general architecture design that enables network-attached memory and offloading tasks onto off-path programmable SmartNIC, and provides a prototype implementation called SODA on Nvidia BlueField DPU.

Abstract

Disaggregated memory breaks the boundary of monolithic servers to enable memory provisioning on demand. Using network-attached memory to provide memory expansion for memory-intensive applications on compute nodes can improve the overall memory utilization on a cluster and reduce the total cost of ownership. However, current software solutions for leveraging network-attached memory must consume resources on the compute node for memory management tasks. Emerging off-path smartNICs provide general-purpose programmability at low-cost low-power cores. This work provides a general architecture design that enables network-attached memory and offloading tasks onto off-path programmable SmartNIC. We provide a prototype implementation called SODA on Nvidia BlueField DPU. SODA adapts communication paths and data transfer alternatives, pipelines data movement stages, and enables customizable data caching and prefetching optimizations. We evaluate SODA in five representative graph applications on real-world graphs. Our results show that SODA can achieve up to 7.9x speedup compared to node-local SSD and reduce network traffic by 42% compared to disaggregated memory without SmartNIC offloading at similar or better performance.

Disaggregated Memory with SmartNIC Offloading: a Case Study on Graph Processing

TL;DR

This work provides a general architecture design that enables network-attached memory and offloading tasks onto off-path programmable SmartNIC, and provides a prototype implementation called SODA on Nvidia BlueField DPU.

Abstract

Disaggregated memory breaks the boundary of monolithic servers to enable memory provisioning on demand. Using network-attached memory to provide memory expansion for memory-intensive applications on compute nodes can improve the overall memory utilization on a cluster and reduce the total cost of ownership. However, current software solutions for leveraging network-attached memory must consume resources on the compute node for memory management tasks. Emerging off-path smartNICs provide general-purpose programmability at low-cost low-power cores. This work provides a general architecture design that enables network-attached memory and offloading tasks onto off-path programmable SmartNIC. We provide a prototype implementation called SODA on Nvidia BlueField DPU. SODA adapts communication paths and data transfer alternatives, pipelines data movement stages, and enables customizable data caching and prefetching optimizations. We evaluate SODA in five representative graph applications on real-world graphs. Our results show that SODA can achieve up to 7.9x speedup compared to node-local SSD and reduce network traffic by 42% compared to disaggregated memory without SmartNIC offloading at similar or better performance.
Paper Structure (20 sections, 3 equations, 11 figures, 2 tables)

This paper contains 20 sections, 3 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: A cluster composed of compute nodes and memory nodes. Each compute node consists of the host CPU and an off-path SmartNIC with DPU. Each memory node is equipped with massive memory resources.
  • Figure 2: SODA consists of agents on the host, SmartNIC SoC (DPU), and memory server. Memory objects in the application process's virtual space can be backed by network-attached memory. SODA agents transparently handle tasks for memory management and data movement for the application.
  • Figure 3: Performance variation when using different NUMA nodes in the host memory at message size 64 KB.
  • Figure 4: Performance of different intra-node communication options (using the fastest NUMA configuration).
  • Figure 5: Comparison of performance between intra-node and inter-node communication on the testbed.
  • ...and 6 more figures