L2R-CIPU: Efficient CNN Computation with Left-to-Right Composite Inner Product Units
Malik Zohaib Nisar, Mohammad Sohail Ibrahim, Muhammad Usman, Jeong-A Lee
TL;DR
This work tackles the throughput bottleneck of traditional right-to-left, bit-serial CNN accelerators by introducing a left-to-right composite inner-product unit (L2R-CIPU) that uses MSDF LR arithmetic and an online reduction tree. The core timing for inner products is modeled as $\delta_{IP} = n^{2} + δ_{Mult}$ with overall cycle count $Cycle_{P}$ incorporating tile and kernel reductions, enabling efficient parallel accumulation via a 6:2 compressor and carry-save registers. Implemented as an 8×8 PE array processing 3×3 windows over 8 input channels, the design demonstrated substantial gains on VGG-16, achieving up to 6.22× higher performance and 15× higher energy efficiency, plus 53.45× higher TOPS/mm² area efficiency versus prior accelerators. These results suggest that LR-based inner-product computation can markedly improve hardware CNN throughput and energy efficiency for commodity deep learning workloads.
Abstract
This paper proposes a composite inner-product computation unit based on left-to-right (LR) arithmetic for the acceleration of convolution neural networks (CNN) on hardware. The efficacy of the proposed L2R-CIPU method has been shown on the VGG-16 network, and assessment is done on various performance metrics. The L2R-CIPU design achieves 1.06x to 6.22x greater performance, 4.8x to 15x more TOPS/W, and 4.51x to 53.45x higher TOPS/mm2 than prior architectures.
