Table of Contents
Fetching ...

3D-TrIM: A Memory-Efficient Spatial Computing Architecture for Convolution Workloads

Cristian Sestito, Ahmed J. Abdelmaksoud, Shady Agwa, Themis Prodromakis

TL;DR

The paper tackles the energycostly data movement in CNN accelerators (Von Neumann bottleneck) by extending the TrIM approach to a 3D systolic architecture. It introduces Shadow Registers and a shared Input Recycling Buffer to maximize on-chip ifmap reuse and reduce memory accesses, forming a 3D arrangement of 576 PEs with shared adder trees for psum accumulation. The design achieves 1.15 TOPS at 1 GHz on 22 nm with 0.26 mm^2 area and 0.25 W, and delivers up to 3.37× improvements in operations per memory access over TrIM on networks like VGG-16 and AlexNet, highlighting strong memory-efficiency gains for CNN workloads. Overall, 3D-TrIM demonstrates how buffer sharing and shadow-register techniques can significantly mitigate memory bottlenecks while maintaining high throughput and energy efficiency for convolution workloads.

Abstract

The Von Neumann bottleneck, which relates to the energy cost of moving data from memory to on-chip core and vice versa, is a serious challenge in state-of-the-art AI architectures, like Convolutional Neural Networks' (CNNs) accelerators. Systolic arrays exploit distributed processing elements that exchange data with each other, thus mitigating the memory cost. However, when involved in convolutions, data redundancy must be carefully managed to avoid significant memory access overhead. To overcome this problem, TrIM has been recently proposed. It features a systolic array based on an innovative dataflow, where input feature map (ifmap) activations are locally reused through a triangular movement. However, ifmaps still suffer from memory accesses overhead. This work proposes 3D-TrIM, an upgraded version of TrIM that addresses the memory access overhead through few extra shadow registers. In addition, due to a change in the architectural orientation, the local shift register buffers are now shared between different slices, thus improving area and energy efficiency. An architecture of 576 processing elements is implemented on commercial 22 nm technology and achieves an area efficiency of 4.47 TOPS/mm$^2$ and an energy efficiency of 4.54 TOPS/W. Finally, 3D-TrIM outperforms TrIM by up to $3.37\times$ in terms of operations per memory access considering CNN topologies like VGG-16 and AlexNet.

3D-TrIM: A Memory-Efficient Spatial Computing Architecture for Convolution Workloads

TL;DR

The paper tackles the energycostly data movement in CNN accelerators (Von Neumann bottleneck) by extending the TrIM approach to a 3D systolic architecture. It introduces Shadow Registers and a shared Input Recycling Buffer to maximize on-chip ifmap reuse and reduce memory accesses, forming a 3D arrangement of 576 PEs with shared adder trees for psum accumulation. The design achieves 1.15 TOPS at 1 GHz on 22 nm with 0.26 mm^2 area and 0.25 W, and delivers up to 3.37× improvements in operations per memory access over TrIM on networks like VGG-16 and AlexNet, highlighting strong memory-efficiency gains for CNN workloads. Overall, 3D-TrIM demonstrates how buffer sharing and shadow-register techniques can significantly mitigate memory bottlenecks while maintaining high throughput and energy efficiency for convolution workloads.

Abstract

The Von Neumann bottleneck, which relates to the energy cost of moving data from memory to on-chip core and vice versa, is a serious challenge in state-of-the-art AI architectures, like Convolutional Neural Networks' (CNNs) accelerators. Systolic arrays exploit distributed processing elements that exchange data with each other, thus mitigating the memory cost. However, when involved in convolutions, data redundancy must be carefully managed to avoid significant memory access overhead. To overcome this problem, TrIM has been recently proposed. It features a systolic array based on an innovative dataflow, where input feature map (ifmap) activations are locally reused through a triangular movement. However, ifmaps still suffer from memory accesses overhead. This work proposes 3D-TrIM, an upgraded version of TrIM that addresses the memory access overhead through few extra shadow registers. In addition, due to a change in the architectural orientation, the local shift register buffers are now shared between different slices, thus improving area and energy efficiency. An architecture of 576 processing elements is implemented on commercial 22 nm technology and achieves an area efficiency of 4.47 TOPS/mm and an energy efficiency of 4.54 TOPS/W. Finally, 3D-TrIM outperforms TrIM by up to in terms of operations per memory access considering CNN topologies like VGG-16 and AlexNet.

Paper Structure

This paper contains 7 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Memory Access Overhead in TrIMSestito_24_1. The values relate to the processing of a single ifmap, with sizes as reported in the horizontal axis. The case of a $3 \times 3$ kernel is considered. Numbers are retrieved using the analytical model reported in Sestito_24_1.
  • Figure 2: Top-level architecture of 3D-TrIM. $P_I$ cores execute convolutions between $P_I$ ifmaps and $P_I \times P_O$ kernels. Each core hosts $P_O$ slices, operating on the same ifmap. To maximize ifmap utilization on-chip, an Input Recycling Buffer is accommodated at the core level and shared among the slices. $P_O$ adder trees accumulate psums to finalize the convolution. A control logic supervises the functionality of the entire architecture over time.
  • Figure 3: The slice when $K=3$. (a) It consists of $3 \times 3$ Processing Elements (PE), interconnected with each other in vertical and horizontal directions. In addition, the slice interacts with the Input Recycling Buffer to provide and read back activations to be reused. The read back activity is finalized by diagonal connections. An adder tree eventually accumulates psums coming from bottom PEs. Red arrows relate to activations; black dashed arrows relate to weights; blue arrows relate to psums. (b) Each PE stores the current activation, weight, psum in registers. Two multiplexers select the direction of the activation to be reused (vertically, horizontally, diagonally). Finally, a pipelined multiply-accumulation unit executes the current computation.
  • Figure 4: The Input Recycling Buffer when $K=3$. It consists of two reconfigurable shift registers, shadow registers and multiplexers. Shift registers read activations from Slice 0 (in each core). After some cycles, these shift registers provide activations back to PEs for reuse. Shadow registers manage end-of-row ifmap activations. Multiplexers select whether the current activations must be provided by shift-registers or by shadow registers.
  • Figure 5: Example of dataflow. An $8 \times 8$ ifmap is considered. The computational cycles from 6 to 13 are visualized in detail. In the ifmap, the area in yellow indicates the portion managed by the shadow registers. For each cycle, the activations processed by the PEs, shift registers and shadow registers are reported. Activations in blue are read from the memory. Activations in orange are shifted through right-to-left movements. Activations in green relate to diagonal movements through shift registers. Activations in yellow relate to diagonal movements using shadow registers. Xs refer to don't care cases.
  • ...and 1 more figures