L2R-CIPU: Efficient CNN Computation with Left-to-Right Composite Inner Product Units

Malik Zohaib Nisar; Mohammad Sohail Ibrahim; Muhammad Usman; Jeong-A Lee

L2R-CIPU: Efficient CNN Computation with Left-to-Right Composite Inner Product Units

Malik Zohaib Nisar, Mohammad Sohail Ibrahim, Muhammad Usman, Jeong-A Lee

TL;DR

This work tackles the throughput bottleneck of traditional right-to-left, bit-serial CNN accelerators by introducing a left-to-right composite inner-product unit (L2R-CIPU) that uses MSDF LR arithmetic and an online reduction tree. The core timing for inner products is modeled as $\delta_{IP} = n^{2} + δ_{Mult}$ with overall cycle count $Cycle_{P}$ incorporating tile and kernel reductions, enabling efficient parallel accumulation via a 6:2 compressor and carry-save registers. Implemented as an 8×8 PE array processing 3×3 windows over 8 input channels, the design demonstrated substantial gains on VGG-16, achieving up to 6.22× higher performance and 15× higher energy efficiency, plus 53.45× higher TOPS/mm² area efficiency versus prior accelerators. These results suggest that LR-based inner-product computation can markedly improve hardware CNN throughput and energy efficiency for commodity deep learning workloads.

Abstract

This paper proposes a composite inner-product computation unit based on left-to-right (LR) arithmetic for the acceleration of convolution neural networks (CNN) on hardware. The efficacy of the proposed L2R-CIPU method has been shown on the VGG-16 network, and assessment is done on various performance metrics. The L2R-CIPU design achieves 1.06x to 6.22x greater performance, 4.8x to 15x more TOPS/W, and 4.51x to 53.45x higher TOPS/mm2 than prior architectures.

L2R-CIPU: Efficient CNN Computation with Left-to-Right Composite Inner Product Units

TL;DR

with overall cycle count

incorporating tile and kernel reductions, enabling efficient parallel accumulation via a 6:2 compressor and carry-save registers. Implemented as an 8×8 PE array processing 3×3 windows over 8 input channels, the design demonstrated substantial gains on VGG-16, achieving up to 6.22× higher performance and 15× higher energy efficiency, plus 53.45× higher TOPS/mm² area efficiency versus prior accelerators. These results suggest that LR-based inner-product computation can markedly improve hardware CNN throughput and energy efficiency for commodity deep learning workloads.

Abstract

Paper Structure (6 sections, 2 figures, 2 tables)

This paper contains 6 sections, 2 figures, 2 tables.

Introduction
Materials and methods
LR Inner Product Algorithm
L2R-CIPU Design
Result and Analysis
Conclusion

Figures (2)

Figure 1: LR Composite Inner Product Computation Unit 2023hardware
Figure 2: Tiling and Processing Element of the L2R-CIPU Design

L2R-CIPU: Efficient CNN Computation with Left-to-Right Composite Inner Product Units

TL;DR

Abstract

L2R-CIPU: Efficient CNN Computation with Left-to-Right Composite Inner Product Units

Authors

TL;DR

Abstract

Table of Contents

Figures (2)