PIM or CXL-PIM? Understanding Architectural Trade-offs Through Large-Scale Benchmarking

I-Ting Lee; Bao-Kai Wang; Liang-Chi Chen; Wen Sheng Lim; Da-Wei Chang; Yu-Ming Chang; Chieng-Chung Ho

PIM or CXL-PIM? Understanding Architectural Trade-offs Through Large-Scale Benchmarking

I-Ting Lee, Bao-Kai Wang, Liang-Chi Chen, Wen Sheng Lim, Da-Wei Chang, Yu-Ming Chang, Chieng-Chung Ho

TL;DR

This paper addresses the data movement bottleneck in processing-in-memory (PIM) systems by comparing conventional DIMM-based PIM with CXL-PIM through large-scale measurements on real hardware and trace-driven modeling. It shows that end-to-end PIM performance is often dominated by host–PIM transfers due to disjoint address spaces, while CXL-PIM eliminates explicit staging at the cost of higher per-access latency. Through systematic benchmarking, the authors identify workload regimes where unified-address access yields meaningful benefits and where traditional PIM remains advantageous, and they highlight opportunities such as pipelined execution and device-assisted data management to improve CXL-PIM scalability. The findings provide practical guidance for near-memory system design and emphasize that choices between PIM and CXL-PIM are workload- and data-volume-dependent.

Abstract

Processing-in-memory (PIM) reduces data movement by executing near memory, but our large-scale characterization on real PIM hardware shows that end-to-end performance is often limited by disjoint host and device address spaces that force explicit staging transfers. In contrast, CXL-PIM provides a unified address space and cache-coherent access at the cost of higher access latency. These opposing interface models create workload-dependent tradeoffs that are not captured by small-scale studies. This work presents a side-by-side, large-scale comparison of PIM and CXL-PIM using measurements from real PIM hardware and trace-driven CXL modeling. We identify when unified-address access amortizes link latency enough to overcome transfer bottlenecks, and when tightly coupled PIM remains preferable. Our results reveal phase- and dataset-size regimes in which the relative ranking between the two architectures reverses, offering practical guidance for future near-memory system design.

PIM or CXL-PIM? Understanding Architectural Trade-offs Through Large-Scale Benchmarking

TL;DR

Abstract

PIM or CXL-PIM? Understanding Architectural Trade-offs Through Large-Scale Benchmarking

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)