Table of Contents
Fetching ...

Performance Characterizations and Usage Guidelines of Samsung CXL Memory Module Hybrid Prototype

Jianping Zeng, Shuyi Pei, Da Zhang, Yuchen Zhou, Amir Beygi, Xuebin Yao, Ramdas Kachare, Tong Zhang, Zongwang Li, Marie Nguyen, Rekha Pitchumani, Yang Soek Ki, Changhee Jung

TL;DR

The paper tackles the memory capacity and persistence gap in data-intensive workloads by evaluating Samsung's CMM-H, a CXL-based memory module that blends a DRAM cache with NAND flash. Through extensive microbenchmarks and workloads spanning volatile and persistent scenarios, it characterizes latency, tail latency, bandwidth, and real-world performance when CMM-H is used as volatile memory, a memory expander, or persistent memory. A key contribution is the demonstration that CMM-H can deliver near-DRAM performance for cache-friendly, limited-footprint workloads and substantial persistence-driven gains for durable services when used with Global Persistent Flush and idempotent processing to avoid heavy WAL logging. The findings offer actionable guidance on workload placement and programming models to exploit CMM-H’s cost-effective memory expansion while balancing latency, bandwidth, and persistence requirements in modern datacenters.

Abstract

The growing prevalence of data-intensive workloads, such as artificial intelligence (AI), machine learning (ML), high-performance computing (HPC), in-memory databases, and real-time analytics, has exposed limitations in conventional memory technologies like DRAM. While DRAM offers low latency and high throughput, it is constrained by high costs, scalability challenges, and volatility, making it less viable for capacity-bound and persistent applications in modern datacenters. Recently, Compute Express Link (CXL) has emerged as a promising alternative, enabling high-speed, cacheline-granular communication between CPUs and external devices. By leveraging CXL technology, NAND flash can now be used as memory expansion, offering three-fold benefits: byte-addressability, scalable capacity, and persistence at a low cost. Samsung's CXL Memory Module Hybrid (CMM-H) is the first product to deliver these benefits through a hardware-only solution, i.e., it does not incur any OS and IO overheads like conventional block devices. In particular, CMM-H integrates a DRAM cache with NAND flash in a single device to deliver near-DRAM latency. This paper presents the first publicly available study for comprehensive characterizations of an FPGA-based CMM-H prototype. Through this study, we address users' concerns about whether a wide variety of applications can successfully run on a memory device backed by NAND flash medium. Additionally, based on these characterizations, we provide key insights into how to best take advantage of the CMM-H device.

Performance Characterizations and Usage Guidelines of Samsung CXL Memory Module Hybrid Prototype

TL;DR

The paper tackles the memory capacity and persistence gap in data-intensive workloads by evaluating Samsung's CMM-H, a CXL-based memory module that blends a DRAM cache with NAND flash. Through extensive microbenchmarks and workloads spanning volatile and persistent scenarios, it characterizes latency, tail latency, bandwidth, and real-world performance when CMM-H is used as volatile memory, a memory expander, or persistent memory. A key contribution is the demonstration that CMM-H can deliver near-DRAM performance for cache-friendly, limited-footprint workloads and substantial persistence-driven gains for durable services when used with Global Persistent Flush and idempotent processing to avoid heavy WAL logging. The findings offer actionable guidance on workload placement and programming models to exploit CMM-H’s cost-effective memory expansion while balancing latency, bandwidth, and persistence requirements in modern datacenters.

Abstract

The growing prevalence of data-intensive workloads, such as artificial intelligence (AI), machine learning (ML), high-performance computing (HPC), in-memory databases, and real-time analytics, has exposed limitations in conventional memory technologies like DRAM. While DRAM offers low latency and high throughput, it is constrained by high costs, scalability challenges, and volatility, making it less viable for capacity-bound and persistent applications in modern datacenters. Recently, Compute Express Link (CXL) has emerged as a promising alternative, enabling high-speed, cacheline-granular communication between CPUs and external devices. By leveraging CXL technology, NAND flash can now be used as memory expansion, offering three-fold benefits: byte-addressability, scalable capacity, and persistence at a low cost. Samsung's CXL Memory Module Hybrid (CMM-H) is the first product to deliver these benefits through a hardware-only solution, i.e., it does not incur any OS and IO overheads like conventional block devices. In particular, CMM-H integrates a DRAM cache with NAND flash in a single device to deliver near-DRAM latency. This paper presents the first publicly available study for comprehensive characterizations of an FPGA-based CMM-H prototype. Through this study, we address users' concerns about whether a wide variety of applications can successfully run on a memory device backed by NAND flash medium. Additionally, based on these characterizations, we provide key insights into how to best take advantage of the CMM-H device.

Paper Structure

This paper contains 27 sections, 12 figures, 6 tables.

Figures (12)

  • Figure 1: High-level architecture of Samsung CMM-H; assume it is connected to conventional CPUs, though it is technically possible to attach CMM-H to other accelerators (e.g., GPUs) in the future
  • Figure 2: Inconsistent program states for singly-linked list insertion across power failure; CMM-H device here functions as persistent memory
  • Figure 3: Normalized random access latencies of DDR5-R and CMM-H to those of DDR5-L (local DRAM); lower is better
  • Figure 4: Tail latency in microseconds of reads for DDR5-L across memory region sizes
  • Figure 5: Tail latency in microseconds of reads for CMM-H across memory region sizes
  • ...and 7 more figures