Table of Contents
Fetching ...

Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology

Bongjoon Hyun, Taehun Kim, Dongjae Lee, Minsoo Rhu

TL;DR

This work deepdive into UPMEM's commercial PIM technology, a general-purpose PIM-enabled parallel computing architecture that is highly programmable and demystify UPMEM's PIM design through a detailed characterization study.

Abstract

Processing-in-memory (PIM) has been explored for decades by computer architects, yet it has never seen the light of day in real-world products due to their high design overheads and lack of a killer application. With the advent of critical memory-intensive workloads, several commercial PIM technologies have been introduced to the market ranging from domain-specific PIM architectures to more general-purpose PIM architectures. In this work, we deepdive into UPMEM's commercial PIM technology, a general-purpose PIM-enabled parallel architecture that is highly programmable. Our first key contribution is the development of a flexible simulation framework for PIM. The simulator we developed (aka PIMulator) enables the compilation of UPMEM-PIM source codes into its compiled machine-level instructions, which are subsequently consumed by our cycle-level performance simulator. Using PIMulator, we demystify UPMEM's PIM design through a detailed characterization study. Building on top of our characterization, we conduct a series of case studies to pathfind important architectural features that we deem will be critical for future PIM architectures to support

Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology

TL;DR

This work deepdive into UPMEM's commercial PIM technology, a general-purpose PIM-enabled parallel computing architecture that is highly programmable and demystify UPMEM's PIM design through a detailed characterization study.

Abstract

Processing-in-memory (PIM) has been explored for decades by computer architects, yet it has never seen the light of day in real-world products due to their high design overheads and lack of a killer application. With the advent of critical memory-intensive workloads, several commercial PIM technologies have been introduced to the market ranging from domain-specific PIM architectures to more general-purpose PIM architectures. In this work, we deepdive into UPMEM's commercial PIM technology, a general-purpose PIM-enabled parallel architecture that is highly programmable. Our first key contribution is the development of a flexible simulation framework for PIM. The simulator we developed (aka PIMulator) enables the compilation of UPMEM-PIM source codes into its compiled machine-level instructions, which are subsequently consumed by our cycle-level performance simulator. Using PIMulator, we demystify UPMEM's PIM design through a detailed characterization study. Building on top of our characterization, we conduct a series of case studies to pathfind important architectural features that we deem will be critical for future PIM architectures to support
Paper Structure (25 sections, 16 figures, 3 tables)

This paper contains 25 sections, 16 figures, 3 tables.

Figures (16)

  • Figure 1: UPMEM-PIM hardware system overview.
  • Figure 2: An element-wise vector addition program written for UPMEM-PIM: (a) host-side and (b) DPU-side program.
  • Figure 3: Memory model of (a) CUDA and (b) UPMEM-PIM. (c) The (physical) address map of UPMEM-PIM.
  • Figure 4: uPIMulator simulation framework overview.
  • Figure 5: PrIM's compute utilization (left axis) and memory read bandwidth utilization (right axis) when executing with $1$/$4$/$16$ threads. While a DPU's theoretical maximum DRAM bandwidth is 700 MB/sec, prior work prim observed that the maximum bandwidth is around $600$ MB/sec in real UPMEM-PIM system. We therefore configured uPIMulator's DRAM bandwidth accordingly. A single DPU's max compute throughput is set as $1$ IPC and compute utilization is the percentage of this max IPC achieved.
  • ...and 11 more figures