Table of Contents
Fetching ...

PIM or CXL-PIM? Understanding Architectural Trade-offs Through Large-Scale Benchmarking

I-Ting Lee, Bao-Kai Wang, Liang-Chi Chen, Wen Sheng Lim, Da-Wei Chang, Yu-Ming Chang, Chieng-Chung Ho

TL;DR

This paper addresses the data movement bottleneck in processing-in-memory (PIM) systems by comparing conventional DIMM-based PIM with CXL-PIM through large-scale measurements on real hardware and trace-driven modeling. It shows that end-to-end PIM performance is often dominated by host–PIM transfers due to disjoint address spaces, while CXL-PIM eliminates explicit staging at the cost of higher per-access latency. Through systematic benchmarking, the authors identify workload regimes where unified-address access yields meaningful benefits and where traditional PIM remains advantageous, and they highlight opportunities such as pipelined execution and device-assisted data management to improve CXL-PIM scalability. The findings provide practical guidance for near-memory system design and emphasize that choices between PIM and CXL-PIM are workload- and data-volume-dependent.

Abstract

Processing-in-memory (PIM) reduces data movement by executing near memory, but our large-scale characterization on real PIM hardware shows that end-to-end performance is often limited by disjoint host and device address spaces that force explicit staging transfers. In contrast, CXL-PIM provides a unified address space and cache-coherent access at the cost of higher access latency. These opposing interface models create workload-dependent tradeoffs that are not captured by small-scale studies. This work presents a side-by-side, large-scale comparison of PIM and CXL-PIM using measurements from real PIM hardware and trace-driven CXL modeling. We identify when unified-address access amortizes link latency enough to overcome transfer bottlenecks, and when tightly coupled PIM remains preferable. Our results reveal phase- and dataset-size regimes in which the relative ranking between the two architectures reverses, offering practical guidance for future near-memory system design.

PIM or CXL-PIM? Understanding Architectural Trade-offs Through Large-Scale Benchmarking

TL;DR

This paper addresses the data movement bottleneck in processing-in-memory (PIM) systems by comparing conventional DIMM-based PIM with CXL-PIM through large-scale measurements on real hardware and trace-driven modeling. It shows that end-to-end PIM performance is often dominated by host–PIM transfers due to disjoint address spaces, while CXL-PIM eliminates explicit staging at the cost of higher per-access latency. Through systematic benchmarking, the authors identify workload regimes where unified-address access yields meaningful benefits and where traditional PIM remains advantageous, and they highlight opportunities such as pipelined execution and device-assisted data management to improve CXL-PIM scalability. The findings provide practical guidance for near-memory system design and emphasize that choices between PIM and CXL-PIM are workload- and data-volume-dependent.

Abstract

Processing-in-memory (PIM) reduces data movement by executing near memory, but our large-scale characterization on real PIM hardware shows that end-to-end performance is often limited by disjoint host and device address spaces that force explicit staging transfers. In contrast, CXL-PIM provides a unified address space and cache-coherent access at the cost of higher access latency. These opposing interface models create workload-dependent tradeoffs that are not captured by small-scale studies. This work presents a side-by-side, large-scale comparison of PIM and CXL-PIM using measurements from real PIM hardware and trace-driven CXL modeling. We identify when unified-address access amortizes link latency enough to overcome transfer bottlenecks, and when tightly coupled PIM remains preferable. Our results reveal phase- and dataset-size regimes in which the relative ranking between the two architectures reverses, offering practical guidance for future near-memory system design.

Paper Structure

This paper contains 16 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Architecture and dataflow of traditional DIMM-based PIM and CXL-PIM. Traditional PIM requires explicit host–to/from-PIM transfers, while CXL-PIM provides unified memory access through CXL.mem.
  • Figure 2: Overall performance of the PIM system over large-scale workloads, showing that PIM fails to scale as dataset sizes grow.
  • Figure 3: Normalized data transfer time between the host and PIM devices. We fix the overall data size and scale the number of PUs from 1 to 512.
  • Figure 4: The ratio of data transfer time to the end-to-end execution time. We fix the overall data size, and scale the number of DPUs from 1 to 512.
  • Figure 5: Data transfer times of CXL-PIM w/ and w/o CXL-Assisted PU Management under MLP workload.