Table of Contents
Fetching ...

PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices

Si Ung Noh, Junguk Hong, Chaemin Lim, Seongyeon Park, Jeehyun Kim, Hanjun Kim, Youngsok Kim, Jinho Lee

TL;DR

The paper tackles the data movement bottleneck in processing-in-memory DIMMs by designing PID-Comm, a fast and flexible inter-PE collective-communication framework. It introduces a virtual hypercube model that lets developers express multi-instance communications across arbitrary dimensions, paired with a high-performance library that applies PE-assisted reordering, in-register modulation, and cross-domain modulation to minimize host involvement. On real UPMEM hardware with 1024 PEs, PID-Comm delivers up to $5.19\times$ throughput improvements for primitives and up to $4.07\times$ speedups over CPU-only baselines across representative workloads (DLRM, GNN, BFS, CC, MLP), demonstrating significant practical impact for PIM-enabled DIMMs. The work also provides a programming framework, evaluation methodology, and open-source release, enabling broader adoption and extension to other PIM technologies.

Abstract

Recent dual in-line memory modules (DIMMs) are starting to support processing-in-memory (PIM) by associating their memory banks with processing elements (PEs), allowing applications to overcome the data movement bottleneck by offloading memory-intensive operations to the PEs. Many highly parallel applications have been shown to benefit from these PIM-enabled DIMMs, but further speedup is often limited by the huge overhead of inter-PE communication. This mainly comes from the slow CPU-mediated inter-PE communication methods which incurs significant performance overheads, making it difficult for PIM-enabled DIMMs to accelerate a wider range of applications. Prior studies have tried to alleviate the communication bottleneck, but they lack enough flexibility and performance to be used for a wide range of applications. In this paper, we present PID-Comm, a fast and flexible collective inter-PE communication framework for commodity PIM-enabled DIMMs. The key idea of PID-Comm is to abstract the PEs as a multi-dimensional hypercube and allow multiple instances of collective inter-PE communication between the PEs belonging to certain dimensions of the hypercube. Leveraging this abstraction, PID-Comm first defines eight collective inter-PE communication patterns that allow applications to easily express their complex communication patterns. Then, PID-Comm provides high-performance implementations of the collective inter-PE communication patterns optimized for the DIMMs. Our evaluation using 16 UPMEM DIMMs and representative parallel algorithms shows that PID-Comm greatly improves the performance by up to 4.20x compared to the existing inter-PE communication implementations. The implementation of PID-Comm is available at https://github.com/AIS-SNU/PID-Comm.

PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices

TL;DR

The paper tackles the data movement bottleneck in processing-in-memory DIMMs by designing PID-Comm, a fast and flexible inter-PE collective-communication framework. It introduces a virtual hypercube model that lets developers express multi-instance communications across arbitrary dimensions, paired with a high-performance library that applies PE-assisted reordering, in-register modulation, and cross-domain modulation to minimize host involvement. On real UPMEM hardware with 1024 PEs, PID-Comm delivers up to throughput improvements for primitives and up to speedups over CPU-only baselines across representative workloads (DLRM, GNN, BFS, CC, MLP), demonstrating significant practical impact for PIM-enabled DIMMs. The work also provides a programming framework, evaluation methodology, and open-source release, enabling broader adoption and extension to other PIM technologies.

Abstract

Recent dual in-line memory modules (DIMMs) are starting to support processing-in-memory (PIM) by associating their memory banks with processing elements (PEs), allowing applications to overcome the data movement bottleneck by offloading memory-intensive operations to the PEs. Many highly parallel applications have been shown to benefit from these PIM-enabled DIMMs, but further speedup is often limited by the huge overhead of inter-PE communication. This mainly comes from the slow CPU-mediated inter-PE communication methods which incurs significant performance overheads, making it difficult for PIM-enabled DIMMs to accelerate a wider range of applications. Prior studies have tried to alleviate the communication bottleneck, but they lack enough flexibility and performance to be used for a wide range of applications. In this paper, we present PID-Comm, a fast and flexible collective inter-PE communication framework for commodity PIM-enabled DIMMs. The key idea of PID-Comm is to abstract the PEs as a multi-dimensional hypercube and allow multiple instances of collective inter-PE communication between the PEs belonging to certain dimensions of the hypercube. Leveraging this abstraction, PID-Comm first defines eight collective inter-PE communication patterns that allow applications to easily express their complex communication patterns. Then, PID-Comm provides high-performance implementations of the collective inter-PE communication patterns optimized for the DIMMs. Our evaluation using 16 UPMEM DIMMs and representative parallel algorithms shows that PID-Comm greatly improves the performance by up to 4.20x compared to the existing inter-PE communication implementations. The implementation of PID-Comm is available at https://github.com/AIS-SNU/PID-Comm.
Paper Structure (51 sections, 24 figures, 3 tables, 1 algorithm)

This paper contains 51 sections, 24 figures, 3 tables, 1 algorithm.

Figures (24)

  • Figure 1: Internal architecture of commodity PIM-enabled DIMMs.
  • Figure 2: Illustrations of eight representative collective communication primitives among four nodes.
  • Figure 3: Communication flow of prior work and PID-Comm.
  • Figure 4: Execution time breakdown of applications on PIM-enabled DIMMs.
  • Figure 5: Virtual hypercube and multi-axis communication topology.
  • ...and 19 more figures