Table of Contents
Fetching ...

Shared-PIM: Enabling Concurrent Computation and Data Flow for Faster Processing-in-DRAM

Ahmed Mamdouh, Haoran Geng, Michael Niemier, Xiaobo Sharon Hu, Dayane Reis

TL;DR

Shared-PIM introduces bank-level concurrency in DRAM by adding BK-bus, BK-SAs, and Shared Rows to enable simultaneous computation and data movement within a memory bank. The architecture achieves ~5× lower inter-subarray data-transfer latency and ~1.2× lower energy than LISA, and when integrated with pLUTo attains ~1.4× faster addition/multiplication with MM/PMM/NTT gains of 40%, 44%, and 31% and BFS/DFS gains around 29%, at ~7.16% extra die area. The approach supports pipelining and broadcasting, improving data-flow for diverse workloads, including graph and numeric transforms, while remaining compatible with existing PIM designs. Overall, Shared-PIM demonstrates substantial practical impact by boosting in-DRAM processing throughput and efficiency with modest architectural overhead.

Abstract

Processing-in-Memory (PIM) enhances memory with computational capabilities, potentially solving energy and latency issues associated with data transfer between memory and processors. However, managing concurrent computation and data flow within the PIM architecture incurs significant latency and energy penalty for applications. This paper introduces Shared-PIM, an architecture for in-DRAM PIM that strategically allocates rows in memory banks, bolstered by memory peripherals, for concurrent processing and data movement. Shared-PIM enables simultaneous computation and data transfer within a memory bank. When compared to LISA, a state-of-the-art architecture that facilitates data transfers for in-DRAM PIM, Shared-PIM reduces data movement latency and energy by 5x and 1.2x respectively. Furthermore, when integrated to a state-of-the-art (SOTA) in-DRAM PIM architecture (pLUTo), Shared-PIM achieves 1.4x faster addition and multiplication, and thereby improves the performance of matrix multiplication (MM) tasks by 40%, polynomial multiplication (PMM) by 44%, and numeric number transfer (NTT) tasks by 31%. Moreover, for graph processing tasks like Breadth-First Search (BFS) and Depth-First Search (DFS), Shared-PIM achieves a 29% improvement in speed, all with an area overhead of just 7.16% compared to the baseline pLUTo.

Shared-PIM: Enabling Concurrent Computation and Data Flow for Faster Processing-in-DRAM

TL;DR

Shared-PIM introduces bank-level concurrency in DRAM by adding BK-bus, BK-SAs, and Shared Rows to enable simultaneous computation and data movement within a memory bank. The architecture achieves ~5× lower inter-subarray data-transfer latency and ~1.2× lower energy than LISA, and when integrated with pLUTo attains ~1.4× faster addition/multiplication with MM/PMM/NTT gains of 40%, 44%, and 31% and BFS/DFS gains around 29%, at ~7.16% extra die area. The approach supports pipelining and broadcasting, improving data-flow for diverse workloads, including graph and numeric transforms, while remaining compatible with existing PIM designs. Overall, Shared-PIM demonstrates substantial practical impact by boosting in-DRAM processing throughput and efficiency with modest architectural overhead.

Abstract

Processing-in-Memory (PIM) enhances memory with computational capabilities, potentially solving energy and latency issues associated with data transfer between memory and processors. However, managing concurrent computation and data flow within the PIM architecture incurs significant latency and energy penalty for applications. This paper introduces Shared-PIM, an architecture for in-DRAM PIM that strategically allocates rows in memory banks, bolstered by memory peripherals, for concurrent processing and data movement. Shared-PIM enables simultaneous computation and data transfer within a memory bank. When compared to LISA, a state-of-the-art architecture that facilitates data transfers for in-DRAM PIM, Shared-PIM reduces data movement latency and energy by 5x and 1.2x respectively. Furthermore, when integrated to a state-of-the-art (SOTA) in-DRAM PIM architecture (pLUTo), Shared-PIM achieves 1.4x faster addition and multiplication, and thereby improves the performance of matrix multiplication (MM) tasks by 40%, polynomial multiplication (PMM) by 44%, and numeric number transfer (NTT) tasks by 31%. Moreover, for graph processing tasks like Breadth-First Search (BFS) and Depth-First Search (DFS), Shared-PIM achieves a 29% improvement in speed, all with an area overhead of just 7.16% compared to the baseline pLUTo.
Paper Structure (23 sections, 9 figures, 4 tables)

This paper contains 23 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: The phases of read operation in a DRAM cell. (Red lines indicate an activated state)
  • Figure 2: The Shared-PIM architecture: (a) The DRAM rank organization, (b) a tile of $512\times512$ DRAM cells, and (c) a single DRAM cell with an additional transistor, which forms a shared cell. (All parts highlighted in red in this figure are part of our proposed Shared-PIM architecture).
  • Figure 3: Comparison of the inter-subarray copy mechanism of LISA lisa versus Shared-PIM. Cells are represented by circles, and the bitlines/BK-bus of Shared-PIM are represented by lines. The open-bitline structure takahashi2001multigigabit is employed for both LISA and Shared-PIM. In this structure, two neighbouring subarrays share a common sense amplifier (SA).
  • Figure 4: (a) Pipeline example using Shared-PIM for a NTT butterfly computation. (b) Pipeline example using Shared-PIM for a matrix multiplication (commands not drawn to scale).
  • Figure 5: A SPICE simulation of Shared-PIM transferring data from a source row to 4 destination rows in different subarrays.
  • ...and 4 more figures