Low-overhead General-purpose Near-Data Processing in CXL Memory Expanders
Hyungkyu Ham, Jeongmin Hong, Geonwoo Park, Yunseon Shin, Okkyun Woo, Wonhyuk Yang, Jinhoon Bae, Eunhyeok Park, Hyojin Sung, Euicheol Lim, Gwangsun Kim
TL;DR
This work tackles the bottlenecks of memory-bound workloads on systems using CXL memory by introducing memory-mapped near-data processing, $M^2$NDP. It fuses a low-overhead host-communication mechanism, $M^2$func, with a lightweight, highly concurrent execution model, $M^2$μthread, implemented inside the CXL memory controller. The approach yields substantial end-to-end speedups and energy savings over passive CXL memory, including notable improvements for OLAP, KVStore, LLM, DLRM, and graph analytics, while maintaining compatibility with unmodified CXL.mem. The results show near-linear scaling across multiple memories and promising cost efficiency, supporting wide applicability in memory-bound, data-centric workloads.
Abstract
Emerging Compute Express Link (CXL) enables cost-efficient memory expansion beyond the local DRAM of processors. While its CXL$.$mem protocol provides minimal latency overhead through an optimized protocol stack, frequent CXL memory accesses can result in significant slowdowns for memory-bound applications whether they are latency-sensitive or bandwidth-intensive. The near-data processing (NDP) in the CXL controller promises to overcome such limitations of passive CXL memory. However, prior work on NDP in CXL memory proposes application-specific units that are not suitable for practical CXL memory-based systems that should support various applications. On the other hand, existing CPU or GPU cores are not cost-effective for NDP because they are not optimized for memory-bound applications. In addition, the communication between the host processor and CXL controller for NDP offloading should achieve low latency, but existing CXL$.$io/PCIe-based mechanisms incur $μ$s-scale latency and are not suitable for fine-grained NDP. To achieve high-performance NDP end-to-end, we propose a low-overhead general-purpose NDP architecture for CXL memory referred to as Memory-Mapped NDP (M$^2$NDP), which comprises memory-mapped functions (M$^2$func) and memory-mapped $μ$threading (M$^2μ$thread). M$^2$func is a CXL$.$mem-compatible low-overhead communication mechanism between the host processor and NDP controller in CXL memory. M$^2μ$thread enables low-cost, general-purpose NDP unit design by introducing lightweight $μ$threads that support highly concurrent execution of kernels with minimal resource wastage. Combining them, M$^2$NDP achieves significant speedups for various workloads by up to 128x (14.5x overall) and reduces energy by up to 87.9% (80.3% overall) compared to baseline CPU/GPU hosts with passive CXL memory.
