Table of Contents
Fetching ...

Generalized Ping-Pong: Off-Chip Memory Bandwidth Centric Pipelining Strategy for Processing-In-Memory Accelerators

Ruibao Wang, Bonan Yan

TL;DR

This work quantitatively analyze and compares the generalized ping-pong with the conventional scheduling strategies of naive ping-pong and in-situ write/compute for PIM, and shows that the generalized ping-pong strategy achieves acceleration of over 1.67 times when fully utilizing the off-chip memory bandwidth.

Abstract

Processing-in-memory (PIM) is a promising choice for accelerating deep neural networks (DNNs) featuring high efficiency and low power. However, the rapid upscaling of neural network model sizes poses a crucial challenge for the limited on-chip PIM capacity. When the PIM presumption of "pre-loading DNN weights/parameters only once before repetitive computing" is no longer practical, concurrent writing and computing techniques become necessary for PIM. Conventional methods of naive ping-pong or in~situ concurrent write/compute scheduling for PIM cause low utilization of off-chip memory bandwidth, subsequently offsetting the efficiency gain brought by PIM technology. To address this challenge, we propose an off-chip memory bandwidth centric pipelining strategy, named "generalized ping-pong", to maximize the utilization and performance of PIM accelerators toward large DNN models. The core idea of the proposed generalized ping-pong strategy is to evenly distribute the active time and fully utilize the off-chip memory bandwidth. Based on a programmable and scalable SRAM PIM architecture, we quantitatively analyze and compare the generalized ping-pong with the conventional scheduling strategies of naive ping-pong and in-situ write/compute for PIM. Experiments show that the generalized ping-pong strategy achieves acceleration of over 1.67 times when fully utilizing the off-chip memory bandwidth. When further limiting the off-chip memory bandwidth ranging in 8~256 bytes per clock cycle, the proposed generalized ping-pong strategy accelerates 1.22~7.71 times versus naive ping-pong. The developed PIM accelerator design with the generalized ping-poing strategy is open-sourced at https://github.com/rw999creator/gpp-pim.

Generalized Ping-Pong: Off-Chip Memory Bandwidth Centric Pipelining Strategy for Processing-In-Memory Accelerators

TL;DR

This work quantitatively analyze and compares the generalized ping-pong with the conventional scheduling strategies of naive ping-pong and in-situ write/compute for PIM, and shows that the generalized ping-pong strategy achieves acceleration of over 1.67 times when fully utilizing the off-chip memory bandwidth.

Abstract

Processing-in-memory (PIM) is a promising choice for accelerating deep neural networks (DNNs) featuring high efficiency and low power. However, the rapid upscaling of neural network model sizes poses a crucial challenge for the limited on-chip PIM capacity. When the PIM presumption of "pre-loading DNN weights/parameters only once before repetitive computing" is no longer practical, concurrent writing and computing techniques become necessary for PIM. Conventional methods of naive ping-pong or in~situ concurrent write/compute scheduling for PIM cause low utilization of off-chip memory bandwidth, subsequently offsetting the efficiency gain brought by PIM technology. To address this challenge, we propose an off-chip memory bandwidth centric pipelining strategy, named "generalized ping-pong", to maximize the utilization and performance of PIM accelerators toward large DNN models. The core idea of the proposed generalized ping-pong strategy is to evenly distribute the active time and fully utilize the off-chip memory bandwidth. Based on a programmable and scalable SRAM PIM architecture, we quantitatively analyze and compare the generalized ping-pong with the conventional scheduling strategies of naive ping-pong and in-situ write/compute for PIM. Experiments show that the generalized ping-pong strategy achieves acceleration of over 1.67 times when fully utilizing the off-chip memory bandwidth. When further limiting the off-chip memory bandwidth ranging in 8~256 bytes per clock cycle, the proposed generalized ping-pong strategy accelerates 1.22~7.71 times versus naive ping-pong. The developed PIM accelerator design with the generalized ping-poing strategy is open-sourced at https://github.com/rw999creator/gpp-pim.

Paper Structure

This paper contains 15 sections, 9 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: PIM accelerators face emerging challenge in high-dimensional GeMM computation: realization of concurrent write/compute.
  • Figure 2: Typical SRAM-based PIM GeMM accelerator circuit block diagram and its two work modes.
  • Figure 3: (a) In situ write/compute strategy. (b) Naive ping-pong strategy. (c) Generalized ping-pong strategy (this work).
  • Figure 4: the specific idle time ratio of macro
  • Figure 5: The base PIM accelerator architecture as an example to implement generalized ping-pong scheduling strategy. This base architecture is revised from PUMA ankit2019puma to support various scheduling strategies.
  • ...and 2 more figures