ICGMM: CXL-enabled Memory Expansion with Intelligent Caching Using Gaussian Mixture Model
Hanqiu Chen, Yitu Wang, Luis Vitorio Cargnini, Mohammadreza Soltaniyeh, Dongyang Li, Gongjin Sun, Pradeep Subedi, Andrew Chang, Yiran Chen, Cong Hao
TL;DR
This paper tackles the memory wall in CXL-enabled memory expansion by introducing ICGMM, a hardware-managed DRAM caching system that uses a two-dimensional Gaussian mixtures-based policy to predict 4KB page accesses. The GMM policy engine operates in hardware on an FPGA, guiding intelligent caching and eviction to improve cache hit rates and reduce SSD latency, all within a dataflow architecture that minimizes overhead. Empirical results on seven benchmarks show cache-miss reductions of 0.32% to 6.14% and average SSD latency reductions of 16.23% to 39.14%, with the GMM approach outperforming LRU and achieving orders-of-magnitude faster inference than LSTM-based policies while using far fewer hardware resources. The work demonstrates that hardware-friendly probabilistic caching can substantially improve memory expansion performance, making CXL-based SSD-backed systems more practical for memory-intensive workloads.
Abstract
Compute Express Link (CXL) emerges as a solution for wide gap between computational speed and data communication rates among host and multiple devices. It fosters a unified and coherent memory space between host and CXL storage devices such as such as Solid-state drive (SSD) for memory expansion, with a corresponding DRAM implemented as the device cache. However, this introduces challenges such as substantial cache miss penalties, sub-optimal caching due to data access granularity mismatch between the DRAM "cache" and SSD "memory", and inefficient hardware cache management. To address these issues, we propose a novel solution, named ICGMM, which optimizes caching and eviction directly on hardware, employing a Gaussian Mixture Model (GMM)-based approach. We prototype our solution on an FPGA board, which demonstrates a noteworthy improvement compared to the classic Least Recently Used (LRU) cache strategy. We observe a decrease in the cache miss rate ranging from 0.32% to 6.14%, leading to a substantial 16.23% to 39.14% reduction in the average SSD access latency. Furthermore, when compared to the state-of-the-art Long Short-Term Memory (LSTM)-based cache policies, our GMM algorithm on FPGA showcases an impressive latency reduction of over 10,000 times. Remarkably, this is achieved while demanding much fewer hardware resources.
