Table of Contents
Fetching ...

UDON: A case for offloading to general purpose compute on CXL memory

Jon Hermes, Josh Minor, Minjun Wu, Adarsh Patil, Eric Van Hensbergen

TL;DR

The paper argues for offloading compute to general-purpose ARM cores embedded in CXL memory devices to improve memory-disaggregated datacenters. Using ARM AArch64 NUMA emulation to model CXL type-2 devices, it evaluates ML inference and vector search workloads with FAISS/HNSW, highlighting data proximity benefits. For ML, the approach can place up to $90%$ of data in remote memory with only $20%$ latency degradation, while vector-search kernels achieve up to $6.87×$ latency savings with under $10%$ overhead. Across workloads, the results indicate near-memory compute on CXL devices is viable and offers substantial performance gains, underscoring the need for enhanced compiler/runtime support to automate offloads in future data-center architectures.

Abstract

Upcoming CXL-based disaggregated memory devices feature special purpose units to offload compute to near-memory. In this paper, we explore opportunities for offloading compute to general purpose cores on CXL memory devices, thereby enabling a greater utility and diversity of offload. We study two classes of popular memory intensive applications: ML inference and vector database as candidates for computational offload. The study uses Arm AArch64-based dual-socket NUMA systems to emulate CXL type-2 devices. Our study shows promising results. With our ML inference model partitioning strategy for compute offload, we can place up to 90% data in remote memory with just 20% performance trade-off. Offloading Hierarchical Navigable Small World (HNSW) kernels in vector databases can provide upto 6.87$\times$ performance improvement with under 10% offload overhead.

UDON: A case for offloading to general purpose compute on CXL memory

TL;DR

The paper argues for offloading compute to general-purpose ARM cores embedded in CXL memory devices to improve memory-disaggregated datacenters. Using ARM AArch64 NUMA emulation to model CXL type-2 devices, it evaluates ML inference and vector search workloads with FAISS/HNSW, highlighting data proximity benefits. For ML, the approach can place up to of data in remote memory with only latency degradation, while vector-search kernels achieve up to latency savings with under overhead. Across workloads, the results indicate near-memory compute on CXL devices is viable and offers substantial performance gains, underscoring the need for enhanced compiler/runtime support to automate offloads in future data-center architectures.

Abstract

Upcoming CXL-based disaggregated memory devices feature special purpose units to offload compute to near-memory. In this paper, we explore opportunities for offloading compute to general purpose cores on CXL memory devices, thereby enabling a greater utility and diversity of offload. We study two classes of popular memory intensive applications: ML inference and vector database as candidates for computational offload. The study uses Arm AArch64-based dual-socket NUMA systems to emulate CXL type-2 devices. Our study shows promising results. With our ML inference model partitioning strategy for compute offload, we can place up to 90% data in remote memory with just 20% performance trade-off. Offloading Hierarchical Navigable Small World (HNSW) kernels in vector databases can provide upto 6.87 performance improvement with under 10% offload overhead.
Paper Structure (11 sections, 4 figures, 3 tables)

This paper contains 11 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: PyTorch inference latency slowdown histogram.
  • Figure 2: Platform B TFLite inference latency slowdown histogram comparing NUMA memory policies and runtimes.
  • Figure 3: Mem offload to latency slowdown on platform B.
  • Figure 4: Mem sensitivity of indexing (dataset sift1M sift1m_ref).