Exploring DRAM Cache Prefetching for Pooled Memory
Chandrahas Tirumalasetty, Narasimha Annapreddy
TL;DR
This work investigates a hardware-based DRAM cache that prefetches sub-page blocks from Fabric Attached Memory (FAM) over a CXL-like interconnect to reduce data-access latency in disaggregated memory systems. It combines a Signature Path Prefetcher (SPP) with a DRAM cache managed by a hardware root complex, and introduces two optimizations: compute-node Prefetch Bandwidth Adaptation and memory-node Weighted Fair Queueing (WFQ) to handle bandwidth contention between demand and prefetch traffic. The approach is evaluated across single- and multi-node configurations using SPEC, PARSEC, Splash, and GAP benchmarks, showing IPC gains around 7% for DRAM cache prefetching, with an additional 7–10% improvement from the optimizations. The results indicate that sub-page DRAM caching, when paired with adaptive throttling and WFQ, can effectively mitigate FAM latency and improve overall system performance in memory-pooled environments.
Abstract
Hardware based memory pooling enabled by interconnect standards like CXL have been gaining popularity amongst cloud providers and system integrators. While pooling memory resources has cost benefits, it comes at a penalty of increased memory access latency. With yet another addition to the memory hierarchy, local DRAM can be potentially used as a block cache(DRAM Cache) for fabric attached memory(FAM) and data prefetching techniques can be used to hide the FAM access latency. This paper proposes a system for prefetching sub-page blocks from FAM into DRAM cache for improving the data access latency and application performance. We further optimize our DRAM cache prefetch mechanism through enhancements that mitigate the performance degradation due to bandwidth contention at FAM. We consider the potential for providing additional functionality at the CXL-memory node through weighted fair queuing of demand and prefetch requests. We compare such a memory-node level approach to adapting prefetch rate at the compute-node based on observed latencies. We evaluate the proposed system in single node and multi-node configurations with applications from SPEC, PARSEC, Splash and GAP benchmark suites. Our evaluation suggests DRAM cache prefetching result in 7% IPC improvement and both of proposed optimizations can further increment IPC by 7-10%.
