Table of Contents
Fetching ...

L2R-CIPU: Efficient CNN Computation with Left-to-Right Composite Inner Product Units

Malik Zohaib Nisar, Mohammad Sohail Ibrahim, Muhammad Usman, Jeong-A Lee

TL;DR

This work tackles the throughput bottleneck of traditional right-to-left, bit-serial CNN accelerators by introducing a left-to-right composite inner-product unit (L2R-CIPU) that uses MSDF LR arithmetic and an online reduction tree. The core timing for inner products is modeled as $\delta_{IP} = n^{2} + δ_{Mult}$ with overall cycle count $Cycle_{P}$ incorporating tile and kernel reductions, enabling efficient parallel accumulation via a 6:2 compressor and carry-save registers. Implemented as an 8×8 PE array processing 3×3 windows over 8 input channels, the design demonstrated substantial gains on VGG-16, achieving up to 6.22× higher performance and 15× higher energy efficiency, plus 53.45× higher TOPS/mm² area efficiency versus prior accelerators. These results suggest that LR-based inner-product computation can markedly improve hardware CNN throughput and energy efficiency for commodity deep learning workloads.

Abstract

This paper proposes a composite inner-product computation unit based on left-to-right (LR) arithmetic for the acceleration of convolution neural networks (CNN) on hardware. The efficacy of the proposed L2R-CIPU method has been shown on the VGG-16 network, and assessment is done on various performance metrics. The L2R-CIPU design achieves 1.06x to 6.22x greater performance, 4.8x to 15x more TOPS/W, and 4.51x to 53.45x higher TOPS/mm2 than prior architectures.

L2R-CIPU: Efficient CNN Computation with Left-to-Right Composite Inner Product Units

TL;DR

This work tackles the throughput bottleneck of traditional right-to-left, bit-serial CNN accelerators by introducing a left-to-right composite inner-product unit (L2R-CIPU) that uses MSDF LR arithmetic and an online reduction tree. The core timing for inner products is modeled as with overall cycle count incorporating tile and kernel reductions, enabling efficient parallel accumulation via a 6:2 compressor and carry-save registers. Implemented as an 8×8 PE array processing 3×3 windows over 8 input channels, the design demonstrated substantial gains on VGG-16, achieving up to 6.22× higher performance and 15× higher energy efficiency, plus 53.45× higher TOPS/mm² area efficiency versus prior accelerators. These results suggest that LR-based inner-product computation can markedly improve hardware CNN throughput and energy efficiency for commodity deep learning workloads.

Abstract

This paper proposes a composite inner-product computation unit based on left-to-right (LR) arithmetic for the acceleration of convolution neural networks (CNN) on hardware. The efficacy of the proposed L2R-CIPU method has been shown on the VGG-16 network, and assessment is done on various performance metrics. The L2R-CIPU design achieves 1.06x to 6.22x greater performance, 4.8x to 15x more TOPS/W, and 4.51x to 53.45x higher TOPS/mm2 than prior architectures.
Paper Structure (6 sections, 2 figures, 2 tables)

This paper contains 6 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: LR Composite Inner Product Computation Unit 2023hardware
  • Figure 2: Tiling and Processing Element of the L2R-CIPU Design