Table of Contents
Fetching ...

Architectural and System Implications of CXL-enabled Tiered Memory

Yujie Yang, Lingfeng Xiang, Peiran Du, Zhen Lin, Weishu Deng, Ren Wang, Andrey Kudryavtsev, Louis Ko, Hui Lu, Jia Rao

TL;DR

This work investigates how CXL-enabled remote memory interacts with conventional DDR memory and the CPU memory hierarchy. By using carefully designed micro-benchmarks, it identifies architectural bottlenecks caused by latency and heterogeneity, notably unfair queuing and limited CXL parallelism that degrade DDR bandwidth and LLC performance. The authors introduce MIKU, a Dynamic Memory Request Control mechanism that prioritizes DDR while serving CXL on a best-effort basis, using service-time estimates to adapt CXL request rates. Evaluations with micro-benchmarks and real workloads show that MIKU can restore DDR throughput near its peak and sustain CXL performance, offering a practical path to efficient tiered memory in next-generation systems.

Abstract

Memory disaggregation is an emerging technology that decouples memory from traditional memory buses, enabling independent scaling of compute and memory. Compute Express Link (CXL), an open-standard interconnect technology, facilitates memory disaggregation by allowing processors to access remote memory through the PCIe bus while preserving the shared-memory programming model. This innovation creates a tiered memory architecture combining local DDR and remote CXL memory with distinct performance characteristics. In this paper, we investigate the architectural implications of CXL memory, focusing on its increased latency and performance heterogeneity, which can undermine the efficiency of existing processor designs optimized for (relatively) uniform memory latency. Using carefully designed micro-benchmarks, we identify bottlenecks such as limited hardware-level parallelism in CXL memory, unfair queuing in memory request handling, and its impact on DDR memory performance and inter-core synchronization. Our findings reveal that the disparity in memory tier parallelism can reduce DDR memory bandwidth by up to 81% under heavy loads. To address these challenges, we propose a Dynamic Memory Request Control mechanism, MIKU, that prioritizes DDR memory requests while serving CXL memory requests on a best-effort basis. By dynamically adjusting CXL request rates based on service time estimates, MIKU achieves near-peak DDR throughput while maintaining high performance for CXL memory. Our evaluation with micro-benchmarks and representative workloads demonstrates the potential of MIKU to enhance tiered memory system efficiency.

Architectural and System Implications of CXL-enabled Tiered Memory

TL;DR

This work investigates how CXL-enabled remote memory interacts with conventional DDR memory and the CPU memory hierarchy. By using carefully designed micro-benchmarks, it identifies architectural bottlenecks caused by latency and heterogeneity, notably unfair queuing and limited CXL parallelism that degrade DDR bandwidth and LLC performance. The authors introduce MIKU, a Dynamic Memory Request Control mechanism that prioritizes DDR while serving CXL on a best-effort basis, using service-time estimates to adapt CXL request rates. Evaluations with micro-benchmarks and real workloads show that MIKU can restore DDR throughput near its peak and sustain CXL performance, offering a practical path to efficient tiered memory in next-generation systems.

Abstract

Memory disaggregation is an emerging technology that decouples memory from traditional memory buses, enabling independent scaling of compute and memory. Compute Express Link (CXL), an open-standard interconnect technology, facilitates memory disaggregation by allowing processors to access remote memory through the PCIe bus while preserving the shared-memory programming model. This innovation creates a tiered memory architecture combining local DDR and remote CXL memory with distinct performance characteristics. In this paper, we investigate the architectural implications of CXL memory, focusing on its increased latency and performance heterogeneity, which can undermine the efficiency of existing processor designs optimized for (relatively) uniform memory latency. Using carefully designed micro-benchmarks, we identify bottlenecks such as limited hardware-level parallelism in CXL memory, unfair queuing in memory request handling, and its impact on DDR memory performance and inter-core synchronization. Our findings reveal that the disparity in memory tier parallelism can reduce DDR memory bandwidth by up to 81% under heavy loads. To address these challenges, we propose a Dynamic Memory Request Control mechanism, MIKU, that prioritizes DDR memory requests while serving CXL memory requests on a best-effort basis. By dynamically adjusting CXL request rates based on service time estimates, MIKU achieves near-peak DDR throughput while maintaining high performance for CXL memory. Our evaluation with micro-benchmarks and representative workloads demonstrates the potential of MIKU to enhance tiered memory system efficiency.

Paper Structure

This paper contains 15 sections, 1 equation, 14 figures, 1 table.

Figures (14)

  • Figure 1: Application utilizes tiered memory through three memory management schemes.
  • Figure 2: State-of-the-art tiered memory management schemes fall short of achieving the expected combined bandwidth of DDR and CXL.
  • Figure 3: The comparison of DDR and CXL memory for single-threaded and peak bandwidth on platform A and platform B.
  • Figure 4: The comparison of DDR and CXL memory latency.
  • Figure 5: Significant bandwidth loss due to concurrent handling of DDR and CXL memory requests. The red dotted lines indicate the maximally achievable DDR bandwidth in the absence of CXL memory traffic.
  • ...and 9 more figures