NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering
Zhe Zhou, Yiqi Chen, Tao Zhang, Yang Wang, Ran Shu, Shuotao Xu, Peng Cheng, Lei Qu, Yongqiang Xiong, Jie Zhang, Guangyu Sun
TL;DR
NeoMem introduces a hardware–software co-design for CXL-native memory tiering by placing a device-side profiler, NeoProf, in CXL memory controllers to deliver high-resolution page hotness and runtime state information to the OS. Guided by a dynamic, histogram- and bandwidth-aware migration policy, NeoMem enables timely hot-page promotion with minimal CPU overhead. Evaluated on real FPGA-based CXL hardware and Linux kernel v6.3, NeoMem achieves 32%–67% geomean speedup over baselines and substantially reduces slow-tier traffic. The work demonstrates the practicality of hardware-accelerated memory profiling for heterogeneous memory systems and outlines pathways for virtualization and multi-device scalability.
Abstract
The Compute Express Link (CXL) interconnect makes it feasible to integrate diverse types of memory into servers via its byte-addressable SerDes links. Considering the various access latency, harnessing the full potential of CXL-based heterogeneous memory systems requires efficient memory tiering. However, prior work can hardly make a fundamental progress owing to low-resolution and high-overhead memory access profiling techniques. To address this critical challenge, we propose a novel memory tiering solution called NeoMem, which features a hardware/software co-design. NeoMem offloads memory profiling functions to CXL device-side controllers, integrating a dedicated hardware unit called NeoProf. NeoProf readily monitors memory accesses and provides the OS with crucial page hotness statistics and other useful system state information. On the OS kernel side, we design a revamped memory-tiering strategy, enabling accurate and timely hot page promotion based on NeoProf statistics. We implement NeoMem on a real FPGA-based CXL memory platform and Linux kernel v6.3. Comprehensive evaluations demonstrate that NeoMem achieves 32% to 67% geomean speedup over several existing memory tiering solutions.
