UDON: A case for offloading to general purpose compute on CXL memory
Jon Hermes, Josh Minor, Minjun Wu, Adarsh Patil, Eric Van Hensbergen
TL;DR
The paper argues for offloading compute to general-purpose ARM cores embedded in CXL memory devices to improve memory-disaggregated datacenters. Using ARM AArch64 NUMA emulation to model CXL type-2 devices, it evaluates ML inference and vector search workloads with FAISS/HNSW, highlighting data proximity benefits. For ML, the approach can place up to $90%$ of data in remote memory with only $20%$ latency degradation, while vector-search kernels achieve up to $6.87×$ latency savings with under $10%$ overhead. Across workloads, the results indicate near-memory compute on CXL devices is viable and offers substantial performance gains, underscoring the need for enhanced compiler/runtime support to automate offloads in future data-center architectures.
Abstract
Upcoming CXL-based disaggregated memory devices feature special purpose units to offload compute to near-memory. In this paper, we explore opportunities for offloading compute to general purpose cores on CXL memory devices, thereby enabling a greater utility and diversity of offload. We study two classes of popular memory intensive applications: ML inference and vector database as candidates for computational offload. The study uses Arm AArch64-based dual-socket NUMA systems to emulate CXL type-2 devices. Our study shows promising results. With our ML inference model partitioning strategy for compute offload, we can place up to 90% data in remote memory with just 20% performance trade-off. Offloading Hierarchical Navigable Small World (HNSW) kernels in vector databases can provide upto 6.87$\times$ performance improvement with under 10% offload overhead.
