IMAGINE: An 8-to-1b 22nm FD-SOI Compute-In-Memory CNN Accelerator With an End-to-End Analog Charge-Based 0.15-8POPS/W Macro Featuring Distribution-Aware Data Reshaping

Adrian Kneip; Martin Lefebvre; Pol Maistriaux; David Bol

IMAGINE: An 8-to-1b 22nm FD-SOI Compute-In-Memory CNN Accelerator With an End-to-End Analog Charge-Based 0.15-8POPS/W Macro Featuring Distribution-Aware Data Reshaping

Adrian Kneip, Martin Lefebvre, Pol Maistriaux, David Bol

TL;DR

IMAGINE addresses fixed-swing limitations in charge-based CIM by introducing swing-adaptive DP and distribution-aware ABN-based data reshaping to enable end-to-end 8-bit in-memory CNN computation. The architecture combines a 1152×256 DP array, MB input-serial/weight-parallel accumulation, and a distribution-shaping ADC, co-trained with CIM-aware CNNs to tolerate nonidealities. Measured results on a 22nm FD-SOI CERBERUS implementation show up to 40 TOPS/W system-level energy efficiency and competitive MNIST/CIFAR-10 accuracy, with high density (~187 kB/mm^2) and linear in-memory gain rescaling enabling versatile workloads. Compared with prior charge-based CIMs, IMAGINE delivers substantial macro-energy efficiency gains and introduces linear in-memory rescaling, expanding the feasible edge-CNN applications for CIM hardware.

Abstract

Charge-domain compute-in-memory (CIM) SRAMs have recently become an enticing compromise between computing efficiency and accuracy to process sub-8b convolutional neural networks (CNNs) at the edge. Yet, they commonly make use of a fixed dot-product (DP) voltage swing, which leads to a loss in effective ADC bits due to data-dependent clipping or truncation effects that waste precious conversion energy and computing accuracy. To overcome this, we present IMAGINE, a workload-adaptive 1-to-8b CIM-CNN accelerator in 22nm FD-SOI. It introduces a 1152x256 end-to-end charge-based macro with a multi-bit DP based on an input-serial, weight-parallel accumulation that avoids power-hungry DACs. An adaptive swing is achieved by combining a channel-wise DP array split with a linear in-ADC implementation of analog batch-normalization (ABN), obtaining a distribution-aware data reshaping. Critical design constraints are relaxed by including the post-silicon equivalent noise within a CIM-aware CNN training framework. Measurement results showcase an 8b system-level energy efficiency of 40TOPS/W at 0.3/0.6V, with competitive accuracies on MNIST and CIFAR-10. Moreover, the peak energy and area efficiencies of the 187kB/mm2 macro respectively reach up to 0.15-8POPS/W and 2.6-154TOPS/mm2, scaling with the 8-to-1b computing precision. These results exceed previous charge-based designs by 3-to-5x while being the first work to provide linear in-memory rescaling.

IMAGINE: An 8-to-1b 22nm FD-SOI Compute-In-Memory CNN Accelerator With an End-to-End Analog Charge-Based 0.15-8POPS/W Macro Featuring Distribution-Aware Data Reshaping

TL;DR

Abstract

Paper Structure (14 sections, 10 equations, 23 figures, 1 table)

This paper contains 14 sections, 10 equations, 23 figures, 1 table.

Introduction
Basics and Challenges of Charge-Based CIM
Proposed CIM-SRAM Architecture
Overall Macro Architecture
Swing-Adaptive Charge-based DP Operator
Multi-Bit Input-and-Weight Accumulation
Distribution-Shaping Charge-Injection ADC
Mismatch and Low-Frequency Noise Calibration
CIM-CNN Accelerator Dataflow
Measurement Results
CIM-SRAM Characterization
CIM-CNN Accelerator Breakdown
Comparison to the State of the Art
Conclusion

Figures (23)

Figure 1: a) Top-down overview of mapping edge AI applications onto compute-in-memory (CIM) hardware for high-efficiency edge CNN processing. b) Precision scope of existing CIM architectures and illustration of the challenges faced by charge-based ones.
Figure 2: a) Simplified view of charge-based CIM-SRAM architectures, which accumulate the local results of analog XNORs from each b) 10T1C bitcell by means of charge injection through their computing capacitance $C_c$ on the column-shared dot-product line (DPL).
Figure 3: a) Considering 8b ADCs, narrow normal distribution of DPL voltages in charge-based CIM-SRAMs lead to multiple wasted precision bits during the conversion. Voltage swing reduction with $N_{on} < N_{rows}$ further reduces this effective number of ADC bits. Providing (i) channel-adaptive DPL range compensation and (ii) pre-ADC ABN rescaling help solve these issues. b) Test error on the MNIST dataset for a 784-512-128-10 MLP with various ABN gain and ADC precisions. Providing a channel-adaptive swing on top of ABN rescaling trades accuracy recovery for $\gamma$ precision.
Figure 4: a) Block diagram of the CERBERUS micro-controller, zooming on b) the IMAGINE mixed-signal CIM-CNN accelerator.
Figure 5: a) Top-level view of the charge-domain 1152$\times$256 CIM-SRAM macro, highlighting blocks of main interest. b) Coarse architecture of the 64 DP-to-ADC analog cores, mapping up to 4b weights and carrying out charge-based operations on the same sub-divided DPL along the way. c) Qualitative depiction of the macro's main operations.
...and 18 more figures

IMAGINE: An 8-to-1b 22nm FD-SOI Compute-In-Memory CNN Accelerator With an End-to-End Analog Charge-Based 0.15-8POPS/W Macro Featuring Distribution-Aware Data Reshaping

TL;DR

Abstract

IMAGINE: An 8-to-1b 22nm FD-SOI Compute-In-Memory CNN Accelerator With an End-to-End Analog Charge-Based 0.15-8POPS/W Macro Featuring Distribution-Aware Data Reshaping

Authors

TL;DR

Abstract

Table of Contents

Figures (23)