PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices
Si Ung Noh, Junguk Hong, Chaemin Lim, Seongyeon Park, Jeehyun Kim, Hanjun Kim, Youngsok Kim, Jinho Lee
TL;DR
The paper tackles the data movement bottleneck in processing-in-memory DIMMs by designing PID-Comm, a fast and flexible inter-PE collective-communication framework. It introduces a virtual hypercube model that lets developers express multi-instance communications across arbitrary dimensions, paired with a high-performance library that applies PE-assisted reordering, in-register modulation, and cross-domain modulation to minimize host involvement. On real UPMEM hardware with 1024 PEs, PID-Comm delivers up to $5.19\times$ throughput improvements for primitives and up to $4.07\times$ speedups over CPU-only baselines across representative workloads (DLRM, GNN, BFS, CC, MLP), demonstrating significant practical impact for PIM-enabled DIMMs. The work also provides a programming framework, evaluation methodology, and open-source release, enabling broader adoption and extension to other PIM technologies.
Abstract
Recent dual in-line memory modules (DIMMs) are starting to support processing-in-memory (PIM) by associating their memory banks with processing elements (PEs), allowing applications to overcome the data movement bottleneck by offloading memory-intensive operations to the PEs. Many highly parallel applications have been shown to benefit from these PIM-enabled DIMMs, but further speedup is often limited by the huge overhead of inter-PE communication. This mainly comes from the slow CPU-mediated inter-PE communication methods which incurs significant performance overheads, making it difficult for PIM-enabled DIMMs to accelerate a wider range of applications. Prior studies have tried to alleviate the communication bottleneck, but they lack enough flexibility and performance to be used for a wide range of applications. In this paper, we present PID-Comm, a fast and flexible collective inter-PE communication framework for commodity PIM-enabled DIMMs. The key idea of PID-Comm is to abstract the PEs as a multi-dimensional hypercube and allow multiple instances of collective inter-PE communication between the PEs belonging to certain dimensions of the hypercube. Leveraging this abstraction, PID-Comm first defines eight collective inter-PE communication patterns that allow applications to easily express their complex communication patterns. Then, PID-Comm provides high-performance implementations of the collective inter-PE communication patterns optimized for the DIMMs. Our evaluation using 16 UPMEM DIMMs and representative parallel algorithms shows that PID-Comm greatly improves the performance by up to 4.20x compared to the existing inter-PE communication implementations. The implementation of PID-Comm is available at https://github.com/AIS-SNU/PID-Comm.
