Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology

Bongjoon Hyun; Taehun Kim; Dongjae Lee; Minsoo Rhu

Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology

Bongjoon Hyun, Taehun Kim, Dongjae Lee, Minsoo Rhu

TL;DR

This work deepdive into UPMEM's commercial PIM technology, a general-purpose PIM-enabled parallel computing architecture that is highly programmable and demystify UPMEM's PIM design through a detailed characterization study.

Abstract

Processing-in-memory (PIM) has been explored for decades by computer architects, yet it has never seen the light of day in real-world products due to their high design overheads and lack of a killer application. With the advent of critical memory-intensive workloads, several commercial PIM technologies have been introduced to the market ranging from domain-specific PIM architectures to more general-purpose PIM architectures. In this work, we deepdive into UPMEM's commercial PIM technology, a general-purpose PIM-enabled parallel architecture that is highly programmable. Our first key contribution is the development of a flexible simulation framework for PIM. The simulator we developed (aka PIMulator) enables the compilation of UPMEM-PIM source codes into its compiled machine-level instructions, which are subsequently consumed by our cycle-level performance simulator. Using PIMulator, we demystify UPMEM's PIM design through a detailed characterization study. Building on top of our characterization, we conduct a series of case studies to pathfind important architectural features that we deem will be critical for future PIM architectures to support

Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology

TL;DR

Abstract

Paper Structure (25 sections, 16 figures, 3 tables)

This paper contains 25 sections, 16 figures, 3 tables.

Introduction
UPMEM-PIM Architecture
Hardware Architecture
Programming Model
System Software for Memory Management
uPIMulator Simulation Framework
Simulator Development
Simulator Availability and Extensibility
Simulator Validation
Simulation Rate
Demystifying UPMEM-PIM with uPIMulator
Analyzing Runtime Performance
Identifying Bottlenecks
Strong Scaling with Multi-DPUs
Pathfinding Future PIM Architectures
...and 10 more sections

Figures (16)

Figure 1: UPMEM-PIM hardware system overview.
Figure 2: An element-wise vector addition program written for UPMEM-PIM: (a) host-side and (b) DPU-side program.
Figure 3: Memory model of (a) CUDA and (b) UPMEM-PIM. (c) The (physical) address map of UPMEM-PIM.
Figure 4: uPIMulator simulation framework overview.
Figure 5: PrIM's compute utilization (left axis) and memory read bandwidth utilization (right axis) when executing with $1$/$4$/$16$ threads. While a DPU's theoretical maximum DRAM bandwidth is 700 MB/sec, prior work prim observed that the maximum bandwidth is around $600$ MB/sec in real UPMEM-PIM system. We therefore configured uPIMulator's DRAM bandwidth accordingly. A single DPU's max compute throughput is set as $1$ IPC and compute utilization is the percentage of this max IPC achieved.
...and 11 more figures

Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology

TL;DR

Abstract

Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology

Authors

TL;DR

Abstract

Table of Contents

Figures (16)