Dissecting CXL Memory Performance at Scale: Analysis, Modeling, and Optimization

Jinshu Liu; Hamid Hadian; Hanchen Xu; Daniel S. Berger; Huaicheng Li

Dissecting CXL Memory Performance at Scale: Analysis, Modeling, and Optimization

Jinshu Liu, Hamid Hadian, Hanchen Xu, Daniel S. Berger, Huaicheng Li

TL;DR

This work tackles the challenge of understanding and optimizing memory performance in CXL-enabled systems, where long and variable tail latencies complicate predictability. The authors introduce SupMario, a scalable framework that combines large-scale workload characterization with lightweight, counter-based performance models to predict slowdown under CXL memory. They decompose slowdowns into CPU-backend stalls and develop dedicated models for DRAM, cache, and store interactions, achieving high predictive accuracy across multiple devices and platforms. Building on these insights, they propose practical policies—best-shot interleaving and Alto—for interleaving and tiering that significantly improve bandwidth-bound and latency-sensitive workloads, respectively. The results demonstrate strong observability, predictability, and optimization potential for future CXL-rich memory hierarchies, with broad implications for memory management and CPU design.

Abstract

We present SupMario, a characterization framework designed to thoroughly analyze, model, and optimize CXL memory performance. SupMario is based on extensive evaluation of 265 workloads spanning 4 real CXL devices within 7 memory latency configurations across 4 processor platforms. SupMario uncovers many key insights, including detailed workload performance at sub-us memory latencies (140-410 ns), CXL tail latencies, CPU tolerance to CXL latencies, CXL performance root-cause analysis and precise performance prediction models. In particular, SupMario performance models rely solely on 12 CPU performance counters and accurately fit over 99% and 91%-94% workloads with a 10% misprediction target for NUMA and CXL memory, respectively. We demonstrate the practical utility of SupMario characterization findings, models, and insights by applying them to popular CXL memory management schemes, such as page interleaving and tiering policies, to identify system inefficiencies during runtime. We introduce a novel ``bestshot'' page interleaving policy and a regulated page tiering policy (Alto) tailored for memory bandwidth- and latency-sensitive workloads. In bandwidth bound scenarios, our ``best-shot'' interleaving, guided by our novel performance prediction model, achieves close-to optimal scenarios by exploiting the aggregate system and CXL/NUMA memory bandwidth. For latency sensitive workloads, Alto, driven by our key insight of utilizing ``amortized'' memory latency to regulate unnecessary page migrations, achieves up to 177% improvement over state-of-the-art memory tiering systems like TPP, as demonstrated through extensive evaluation with 8 real-world applications.

Dissecting CXL Memory Performance at Scale: Analysis, Modeling, and Optimization

TL;DR

Abstract

Paper Structure (27 sections, 5 equations, 19 figures, 4 tables)

This paper contains 27 sections, 5 equations, 19 figures, 4 tables.

Introduction
Background and Motivation
Overview and CXL Characterization
SupMario Overview
Platform
CXL Device Characterization
Workload Characterization
Performance Modeling
Slowdown Root-Cause Analysis
Cache Slowdown ($S_{cache}$)
Workload Slowdown Diversity
CXL Slowdown Prediction
Strawman
Latency and Bandwidth Sensitivity
DRAM (Load) Slowdown Model
...and 12 more sections

Figures (19)

Figure 1: CXL latency and bandwidth heterogeneity.
Figure 2: Overview. Our in-depth and at-scale characterization enable CXL performance modeling and optimization.
Figure 3: CXL Latency CDF. Not all CXL are created equal. Unlike local/NUMA memory, CXL shows high tail latencies.
Figure 4: CDFs of workload slowdowns under various CXL. (a) the CDFs of SPEC workloads on all our platforms; (b) tail latency is the cause of significant workload slowdown under CXL+NUMA for a latency-insensitive workload; (c) SPR vs. EMR SPEC results under CXL-A and CXL-B; (d) is similar to (c) but for all 265 workloads.
Figure 5: CXL slowdown breakdown. Figure (a) shows various components where CXL introduces overheads; Figure (b) details the flow of CXL-induced cache slowdowns.
...and 14 more figures

Dissecting CXL Memory Performance at Scale: Analysis, Modeling, and Optimization

TL;DR

Abstract

Dissecting CXL Memory Performance at Scale: Analysis, Modeling, and Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (19)