Table of Contents
Fetching ...

Optimizing and Exploring System Performance in Compact Processing-in-Memory-based Chips

Peilin Chen, Xiaoxuan Yang

TL;DR

The paper tackles the challenge that area-constrained PIM chips struggle to store full neural network weights, causing costly data movement and eroding PIM benefits. It introduces a data-movement–aware pipeline for compact PIM and a Dynamic Duplication Method (DDM) to mitigate pipeline bottlenecks, along with an analysis of how NN size trade-offs impact performance. Experimental results show that the pipeline plus DDM deliver substantial gains, including a 2.35x throughput improvement over a non-DDM baseline, and up to 4.56x throughput and 157x energy efficiency gains compared with a RTX 4090 GPU, while using only one-third of the area of an area-unlimited PIM design and achieving about 16.2 GOPS/mm^2. The work also provides practical guidance on maximum deployable NN size (between ResNet-50 and ResNet-101) for compact PIM, demonstrating that carefully engineered data movement and scheduling can markedly close the gap with area-unlimited designs and enable competitive CNN inference on constrained hardware.

Abstract

Processing-in-memory (PIM) is a promising computing paradigm to tackle the "memory wall" challenge. However, PIM system-level benefits over traditional von Neumann architecture can be reduced when the memory array cannot fully store all the neural network (NN) weights. The NN size is increasing while the PIM design size cannot scale up accordingly due to area constraints. Therefore, this work targets the system performance optimization and exploration for compact PIM designs. We first analyze the impact of data movement on compact designs. Then, we propose a novel pipeline method that maximizes the reuse of NN weights to improve the throughput and energy efficiency of inference in compact chips. To further boost throughput, we introduce a scheduling algorithm to mitigate the pipeline bubble problem. Moreover, we investigate the trade-off between the network size and system performance for a compact PIM chip. Experimental results show that the proposed algorithm achieves 2.35x and 0.5% improvement in throughput and energy efficiency, respectively. Compared to the area-unlimited design, our compact chip achieves approximately 56.5% of the throughput and 58.6% of the energy efficiency while using only one-third of the chip area, along with 1.3x improvement in area efficiency. Our compact design also outperforms the modern GPU with 4.56x higher throughput and 157x better energy efficiency. Besides, our compact design uses less than 20% of the system energy for data movement as batch size scales up.

Optimizing and Exploring System Performance in Compact Processing-in-Memory-based Chips

TL;DR

The paper tackles the challenge that area-constrained PIM chips struggle to store full neural network weights, causing costly data movement and eroding PIM benefits. It introduces a data-movement–aware pipeline for compact PIM and a Dynamic Duplication Method (DDM) to mitigate pipeline bottlenecks, along with an analysis of how NN size trade-offs impact performance. Experimental results show that the pipeline plus DDM deliver substantial gains, including a 2.35x throughput improvement over a non-DDM baseline, and up to 4.56x throughput and 157x energy efficiency gains compared with a RTX 4090 GPU, while using only one-third of the area of an area-unlimited PIM design and achieving about 16.2 GOPS/mm^2. The work also provides practical guidance on maximum deployable NN size (between ResNet-50 and ResNet-101) for compact PIM, demonstrating that carefully engineered data movement and scheduling can markedly close the gap with area-unlimited designs and enable competitive CNN inference on constrained hardware.

Abstract

Processing-in-memory (PIM) is a promising computing paradigm to tackle the "memory wall" challenge. However, PIM system-level benefits over traditional von Neumann architecture can be reduced when the memory array cannot fully store all the neural network (NN) weights. The NN size is increasing while the PIM design size cannot scale up accordingly due to area constraints. Therefore, this work targets the system performance optimization and exploration for compact PIM designs. We first analyze the impact of data movement on compact designs. Then, we propose a novel pipeline method that maximizes the reuse of NN weights to improve the throughput and energy efficiency of inference in compact chips. To further boost throughput, we introduce a scheduling algorithm to mitigate the pipeline bubble problem. Moreover, we investigate the trade-off between the network size and system performance for a compact PIM chip. Experimental results show that the proposed algorithm achieves 2.35x and 0.5% improvement in throughput and energy efficiency, respectively. Compared to the area-unlimited design, our compact chip achieves approximately 56.5% of the throughput and 58.6% of the energy efficiency while using only one-third of the chip area, along with 1.3x improvement in area efficiency. Our compact design also outperforms the modern GPU with 4.56x higher throughput and 157x better energy efficiency. Besides, our compact design uses less than 20% of the system energy for data movement as batch size scales up.

Paper Structure

This paper contains 12 sections, 3 equations, 8 figures, 1 algorithm.

Figures (8)

  • Figure 1: Chip area required by ResNet NNs to store all weights on SRAM and RRAM array under 32nm process. Such designs are referred to as area-unlimited designs.
  • Figure 2: Overall workflow of our design. PE: Processing Engine. S: Subarray.
  • Figure 3: Normalized data transaction number for different batch sizes between PIM designs and LPDDR5.
  • Figure 4: Pipeline method in area-unlimited designs (case 1) vs. our pipeline method in compact PIM-based chips (case 2 and case 3). In this example, we assume that the NN has five CONV/FC layers (L1$\sim$L5).
  • Figure 5: Left: Five NN layers are divided into two parts to be mapped onto the accelerator. Right: Execution process of the NN. WB denotes "Write Back".
  • ...and 3 more figures