Table of Contents
Fetching ...

CXL Topology-Aware and Expander-Driven Prefetching: Unlocking SSD Performance

Dongsuk Oh, Miryeong Kwon, Jiseon Kim, Eunjee Na, Junseok Moon, Hyunkyu Choi, Seonghyeon Jang, Hanjin Choi, Hongjoo Jung, Sangwon Lee, Myoungsoo Jung

TL;DR

This work targets the latency gap between DRAM and SCM-backed CXL-SSDs by offloading LLC prefetching from the CPU to the CXL expander network, leveraging CXL.mem back-invalidation for coherence. It introduces ExPAND, an architecture with a host-side reflector and an expander-side decider that uses a heterogeneous ML-based address predictor and a timing predictor to generate timely prefetches, while computing end-to-end prefetch timeliness from CXL topology data. The paper demonstrates that ExPAND can substantially outperform baseline prefetchers, achieving up to $9.0\times$ speedups over NoPrefetch for graph workloads and $14.7\times$ for SPEC CPU benchmarks, with strong gains when backend media latency is favorable. The combination of topology-aware timing, bidirectional CXL communication, and ML-assisted addressing enables data to be brought closer to the host LLC more efficiently, reducing reliance on CXL-SSDs and improving real-world performance for memory-disaggregated systems.

Abstract

Integrating compute express link (CXL) with SSDs allows scalable access to large memory but has slower speeds than DRAMs. We present ExPAND, an expander-driven CXL prefetcher that offloads last-level cache (LLC) prefetching from host CPU to CXL-SSDs. ExPAND uses a heterogeneous prediction algorithm for prefetching and ensures data consistency with CXL.mem's back-invalidation. We examine prefetch timeliness for accurate latency estimation. ExPAND, being aware of CXL multi-tiered switching, provides end-to-end latency for each CXL-SSD and precise prefetch timeliness estimations. Our method reduces CXL-SSD reliance and enables direct host cache access for most data. ExPAND enhances graph application performance and SPEC CPU's performance by 9.0$\times$ and 14.7$\times$, respectively, surpassing CXL-SSD pools with diverse prefetching strategies.

CXL Topology-Aware and Expander-Driven Prefetching: Unlocking SSD Performance

TL;DR

This work targets the latency gap between DRAM and SCM-backed CXL-SSDs by offloading LLC prefetching from the CPU to the CXL expander network, leveraging CXL.mem back-invalidation for coherence. It introduces ExPAND, an architecture with a host-side reflector and an expander-side decider that uses a heterogeneous ML-based address predictor and a timing predictor to generate timely prefetches, while computing end-to-end prefetch timeliness from CXL topology data. The paper demonstrates that ExPAND can substantially outperform baseline prefetchers, achieving up to speedups over NoPrefetch for graph workloads and for SPEC CPU benchmarks, with strong gains when backend media latency is favorable. The combination of topology-aware timing, bidirectional CXL communication, and ML-assisted addressing enables data to be brought closer to the host LLC more efficiently, reducing reliance on CXL-SSDs and improving real-world performance for memory-disaggregated systems.

Abstract

Integrating compute express link (CXL) with SSDs allows scalable access to large memory but has slower speeds than DRAMs. We present ExPAND, an expander-driven CXL prefetcher that offloads last-level cache (LLC) prefetching from host CPU to CXL-SSDs. ExPAND uses a heterogeneous prediction algorithm for prefetching and ensures data consistency with CXL.mem's back-invalidation. We examine prefetch timeliness for accurate latency estimation. ExPAND, being aware of CXL multi-tiered switching, provides end-to-end latency for each CXL-SSD and precise prefetch timeliness estimations. Our method reduces CXL-SSD reliance and enables direct host cache access for most data. ExPAND enhances graph application performance and SPEC CPU's performance by 9.0 and 14.7, respectively, surpassing CXL-SSD pools with diverse prefetching strategies.

Paper Structure

This paper contains 17 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Analyzing the impact of locality.
  • Figure 2: CXL-SSD prefetching performance analysis.
  • Figure 3: Overview of ExPAND.
  • Figure 4: Overall performance.
  • Figure 5: Performance comparison with local DRAM.
  • ...and 2 more figures