UDON: A case for offloading to general purpose compute on CXL memory

Jon Hermes; Josh Minor; Minjun Wu; Adarsh Patil; Eric Van Hensbergen

UDON: A case for offloading to general purpose compute on CXL memory

Jon Hermes, Josh Minor, Minjun Wu, Adarsh Patil, Eric Van Hensbergen

TL;DR

The paper argues for offloading compute to general-purpose ARM cores embedded in CXL memory devices to improve memory-disaggregated datacenters. Using ARM AArch64 NUMA emulation to model CXL type-2 devices, it evaluates ML inference and vector search workloads with FAISS/HNSW, highlighting data proximity benefits. For ML, the approach can place up to $90%$ of data in remote memory with only $20%$ latency degradation, while vector-search kernels achieve up to $6.87×$ latency savings with under $10%$ overhead. Across workloads, the results indicate near-memory compute on CXL devices is viable and offers substantial performance gains, underscoring the need for enhanced compiler/runtime support to automate offloads in future data-center architectures.

Abstract

Upcoming CXL-based disaggregated memory devices feature special purpose units to offload compute to near-memory. In this paper, we explore opportunities for offloading compute to general purpose cores on CXL memory devices, thereby enabling a greater utility and diversity of offload. We study two classes of popular memory intensive applications: ML inference and vector database as candidates for computational offload. The study uses Arm AArch64-based dual-socket NUMA systems to emulate CXL type-2 devices. Our study shows promising results. With our ML inference model partitioning strategy for compute offload, we can place up to 90% data in remote memory with just 20% performance trade-off. Offloading Hierarchical Navigable Small World (HNSW) kernels in vector databases can provide upto 6.87$\times$ performance improvement with under 10% offload overhead.

UDON: A case for offloading to general purpose compute on CXL memory

TL;DR

of data in remote memory with only

latency degradation, while vector-search kernels achieve up to

latency savings with under

overhead. Across workloads, the results indicate near-memory compute on CXL devices is viable and offers substantial performance gains, underscoring the need for enhanced compiler/runtime support to automate offloads in future data-center architectures.

Abstract

performance improvement with under 10% offload overhead.

Paper Structure (11 sections, 4 figures, 3 tables)

This paper contains 11 sections, 4 figures, 3 tables.

Introduction
Background and Motivation
Machine learning inference
Vector databases
Evaluation Methodology
Evaluation Results
Machine learning characterization
PyTorch framework
TFLite framework
Vector database characterization
Conclusion

Figures (4)

Figure 1: PyTorch inference latency slowdown histogram.
Figure 2: Platform B TFLite inference latency slowdown histogram comparing NUMA memory policies and runtimes.
Figure 3: Mem offload to latency slowdown on platform B.
Figure 4: Mem sensitivity of indexing (dataset sift1M sift1m_ref).

UDON: A case for offloading to general purpose compute on CXL memory

TL;DR

Abstract

UDON: A case for offloading to general purpose compute on CXL memory

Authors

TL;DR

Abstract

Table of Contents

Figures (4)