Table of Contents
Fetching ...

PIMfused: Near-Bank DRAM-PIM with Fused-layer Dataflow for CNN Data Transfer Optimization

Simei Yang, Xinyu Shi, Lu Zhao, Yunyu Ling, Quanjun Wang, Francky Catthoor

TL;DR

This work addresses cross-bank data-transfer bottlenecks in CNN acceleration on near-bank DRAM-PIM by introducing PIMfused, a hardware–software co-design that employs fused-layer dataflow to reduce inter-bank dependencies during end-to-end CNN execution. The architecture combines bank-level PIMcores for fused kernels and a channel-level GBcore with local LBUFs and a global GBUF, controlled by novel PIM commands to coordinate computation and data movement. The dataflow blends fused-layer processing for shallow layers with layer-by-layer processing for deeper layers, enabling end-to-end CNNs such as ResNet18 with reduced memory cycles, energy, and area relative to a GDDR6-AiM baseline. Experimental results show that 4-bank PIMcores with PIMfused achieve memory-cycle reductions to 30.6%, energy reductions to 83.4%, and area reductions to 76.5%, illustrating a practical path toward efficient CNN acceleration in near-bank PIM systems.

Abstract

Near-bank Processing-in-Memory (PIM) architectures integrate processing cores (PIMcores) close to DRAM banks to mitigate the high cost of off-chip memory accesses. When accelerating convolutional neural network (CNN) on DRAM-PIM, performance is often constrained by cross-bank (or cross-PIMcore) data transfers, which are induced by the conventional layer-by-layer dataflow that enforces inter-bank (or inter-PIMcore) dependencies across successive CNN layers. To address this challenge, we propose PIMfused, a hardware-software co-design that enables fused-layer dataflow for end-to-end CNN execution in near-bank DRAM-PIM. By adopting fused-layer dataflow, PIMfused improves data reuse and, more importantly, breaks inter-bank data dependencies, thereby optimizing cross-bank data transfers without sacrificing bank-level parallelism. We study the impact of buffer sizes and PIMcore parallelism (1-bank vs. 4-bank) on PIMfused using end-to-end ResNet18. We present three key takeaways and show that with 4-bank PIMcores, PIMfused achieves overall PPA gains over a GDDR6-AiM-like baseline, cutting memory cycles to 30.6%, energy to 83.4%, and area to 76.5%.

PIMfused: Near-Bank DRAM-PIM with Fused-layer Dataflow for CNN Data Transfer Optimization

TL;DR

This work addresses cross-bank data-transfer bottlenecks in CNN acceleration on near-bank DRAM-PIM by introducing PIMfused, a hardware–software co-design that employs fused-layer dataflow to reduce inter-bank dependencies during end-to-end CNN execution. The architecture combines bank-level PIMcores for fused kernels and a channel-level GBcore with local LBUFs and a global GBUF, controlled by novel PIM commands to coordinate computation and data movement. The dataflow blends fused-layer processing for shallow layers with layer-by-layer processing for deeper layers, enabling end-to-end CNNs such as ResNet18 with reduced memory cycles, energy, and area relative to a GDDR6-AiM baseline. Experimental results show that 4-bank PIMcores with PIMfused achieve memory-cycle reductions to 30.6%, energy reductions to 83.4%, and area reductions to 76.5%, illustrating a practical path toward efficient CNN acceleration in near-bank PIM systems.

Abstract

Near-bank Processing-in-Memory (PIM) architectures integrate processing cores (PIMcores) close to DRAM banks to mitigate the high cost of off-chip memory accesses. When accelerating convolutional neural network (CNN) on DRAM-PIM, performance is often constrained by cross-bank (or cross-PIMcore) data transfers, which are induced by the conventional layer-by-layer dataflow that enforces inter-bank (or inter-PIMcore) dependencies across successive CNN layers. To address this challenge, we propose PIMfused, a hardware-software co-design that enables fused-layer dataflow for end-to-end CNN execution in near-bank DRAM-PIM. By adopting fused-layer dataflow, PIMfused improves data reuse and, more importantly, breaks inter-bank data dependencies, thereby optimizing cross-bank data transfers without sacrificing bank-level parallelism. We study the impact of buffer sizes and PIMcore parallelism (1-bank vs. 4-bank) on PIMfused using end-to-end ResNet18. We present three key takeaways and show that with 4-bank PIMcores, PIMfused achieves overall PPA gains over a GDDR6-AiM-like baseline, cutting memory cycles to 30.6%, energy to 83.4%, and area to 76.5%.

Paper Structure

This paper contains 15 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Dataflow comparison highlighting data transfers for PIMcore$_0$ and PIMcore$_3$: (a) Layer-by-layer dataflow. (b) Fused-layer dataflow. Fmaps (feature maps).
  • Figure 2: The PIMFused architecture within a memory channel.
  • Figure 3: (a) CNN graph example. (b) Layer-by-layer dataflow on PIMfused with POOL and ADD_RELU executed on GBcore. (c) PIMfused dataflow, storing intermediate data in local bank or LBUF in PIMcore. (*BK2GBUF: data transfer from bank to GBUF; *GBUF2BKF: data transfer from GBUF to bank.)
  • Figure 4: Overview of our profiling framework.
  • Figure 5: Normalized system PPA with increasing GBUF and no LBUF (w.r.t. AiM-like with G2K_L0).
  • ...and 2 more figures