3D-TrIM: A Memory-Efficient Spatial Computing Architecture for Convolution Workloads
Cristian Sestito, Ahmed J. Abdelmaksoud, Shady Agwa, Themis Prodromakis
TL;DR
The paper tackles the energycostly data movement in CNN accelerators (Von Neumann bottleneck) by extending the TrIM approach to a 3D systolic architecture. It introduces Shadow Registers and a shared Input Recycling Buffer to maximize on-chip ifmap reuse and reduce memory accesses, forming a 3D arrangement of 576 PEs with shared adder trees for psum accumulation. The design achieves 1.15 TOPS at 1 GHz on 22 nm with 0.26 mm^2 area and 0.25 W, and delivers up to 3.37× improvements in operations per memory access over TrIM on networks like VGG-16 and AlexNet, highlighting strong memory-efficiency gains for CNN workloads. Overall, 3D-TrIM demonstrates how buffer sharing and shadow-register techniques can significantly mitigate memory bottlenecks while maintaining high throughput and energy efficiency for convolution workloads.
Abstract
The Von Neumann bottleneck, which relates to the energy cost of moving data from memory to on-chip core and vice versa, is a serious challenge in state-of-the-art AI architectures, like Convolutional Neural Networks' (CNNs) accelerators. Systolic arrays exploit distributed processing elements that exchange data with each other, thus mitigating the memory cost. However, when involved in convolutions, data redundancy must be carefully managed to avoid significant memory access overhead. To overcome this problem, TrIM has been recently proposed. It features a systolic array based on an innovative dataflow, where input feature map (ifmap) activations are locally reused through a triangular movement. However, ifmaps still suffer from memory accesses overhead. This work proposes 3D-TrIM, an upgraded version of TrIM that addresses the memory access overhead through few extra shadow registers. In addition, due to a change in the architectural orientation, the local shift register buffers are now shared between different slices, thus improving area and energy efficiency. An architecture of 576 processing elements is implemented on commercial 22 nm technology and achieves an area efficiency of 4.47 TOPS/mm$^2$ and an energy efficiency of 4.54 TOPS/W. Finally, 3D-TrIM outperforms TrIM by up to $3.37\times$ in terms of operations per memory access considering CNN topologies like VGG-16 and AlexNet.
