Table of Contents
Fetching ...

HCiM: ADC-Less Hybrid Analog-Digital Compute in Memory Accelerator for Deep Learning Workloads

Shubham Negi, Utkarsh Saxena, Deepika Sharma, Kaushik Roy

TL;DR

HCiM addresses the ADC bottleneck in Compute-in-Memory by coupling PSQ-based scale factor quantization with an ADC-Less hybrid architecture. It keeps MVM in analog crossbars while a DCiM array processes the scale factors through in_memory addition/subtraction, leveraging sparsity to save energy. Quantization-aware training preserves accuracy even with binary/ternary partial sums, achieving up to 28× energy savings and 12× latency_area improvements over ADC-based baselines on CIFAR-10 and ImageNet workloads. The approach demonstrates practical impact for deep learning workloads by reducing hardware overhead without significant accuracy loss, validated through cycle_accurate simulations and diverse model/dataset setups.

Abstract

Analog Compute-in-Memory (CiM) accelerators are increasingly recognized for their efficiency in accelerating Deep Neural Networks (DNN). However, their dependence on Analog-to-Digital Converters (ADCs) for accumulating partial sums from crossbars leads to substantial power and area overhead. Moreover, the high area overhead of ADCs constrains the throughput due to the limited number of ADCs that can be integrated per crossbar. An approach to mitigate this issue involves the adoption of extreme low-precision quantization (binary or ternary) for partial sums. Training based on such an approach eliminates the need for ADCs. While this strategy effectively reduces ADC costs, it introduces the challenge of managing numerous floating-point scale factors, which are trainable parameters like DNN weights. These scale factors must be multiplied with the binary or ternary outputs at the columns of the crossbar to ensure system accuracy. To that effect, we propose an algorithm-hardware co-design approach, where DNNs are first trained with quantization-aware training. Subsequently, we introduce HCiM, an ADC-Less Hybrid Analog-Digital CiM accelerator. HCiM uses analog CiM crossbars for performing Matrix-Vector Multiplication operations coupled with a digital CiM array dedicated to processing scale factors. This digital CiM array can execute both addition and subtraction operations within the memory array, thus enhancing processing speed. Additionally, it exploits the inherent sparsity in ternary quantization to achieve further energy savings. Compared to an analog CiM baseline architecture using 7 and 4-bit ADC, HCiM achieves energy reductions up to 28% and 12%, respectively

HCiM: ADC-Less Hybrid Analog-Digital Compute in Memory Accelerator for Deep Learning Workloads

TL;DR

HCiM addresses the ADC bottleneck in Compute-in-Memory by coupling PSQ-based scale factor quantization with an ADC-Less hybrid architecture. It keeps MVM in analog crossbars while a DCiM array processes the scale factors through in_memory addition/subtraction, leveraging sparsity to save energy. Quantization-aware training preserves accuracy even with binary/ternary partial sums, achieving up to 28× energy savings and 12× latency_area improvements over ADC-based baselines on CIFAR-10 and ImageNet workloads. The approach demonstrates practical impact for deep learning workloads by reducing hardware overhead without significant accuracy loss, validated through cycle_accurate simulations and diverse model/dataset setups.

Abstract

Analog Compute-in-Memory (CiM) accelerators are increasingly recognized for their efficiency in accelerating Deep Neural Networks (DNN). However, their dependence on Analog-to-Digital Converters (ADCs) for accumulating partial sums from crossbars leads to substantial power and area overhead. Moreover, the high area overhead of ADCs constrains the throughput due to the limited number of ADCs that can be integrated per crossbar. An approach to mitigate this issue involves the adoption of extreme low-precision quantization (binary or ternary) for partial sums. Training based on such an approach eliminates the need for ADCs. While this strategy effectively reduces ADC costs, it introduces the challenge of managing numerous floating-point scale factors, which are trainable parameters like DNN weights. These scale factors must be multiplied with the binary or ternary outputs at the columns of the crossbar to ensure system accuracy. To that effect, we propose an algorithm-hardware co-design approach, where DNNs are first trained with quantization-aware training. Subsequently, we introduce HCiM, an ADC-Less Hybrid Analog-Digital CiM accelerator. HCiM uses analog CiM crossbars for performing Matrix-Vector Multiplication operations coupled with a digital CiM array dedicated to processing scale factors. This digital CiM array can execute both addition and subtraction operations within the memory array, thus enhancing processing speed. Additionally, it exploits the inherent sparsity in ternary quantization to achieve further energy savings. Compared to an analog CiM baseline architecture using 7 and 4-bit ADC, HCiM achieves energy reductions up to 28% and 12%, respectively
Paper Structure (13 sections, 4 equations, 7 figures, 3 tables)

This paper contains 13 sections, 4 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: (a) ResNet-20 trained with standard training mapped to CiM hardware. (b) ResNet-20 trained with Partial-Sum Quantization (PSQ) training mapped to HCiM has $15\times$ and $11\times$ lower energy and latency. Area-normalized latency is reported, reflecting differences in baseline areas.
  • Figure 2: (a) Overview of PSQ training algorithm. (b) ResNet20 accuracy with ADC precision. (c) Distribution of p at the columns of the crossbar. Scale factor (SF) access energy compared to total off-chip data access energy. (d) Impact of reducing the number of scale factors on application accuracy.
  • Figure 3: (a) Hybrid Analog-Digital CiM macro. (b) Architecture of digital CiM array, incorporating column peripherals with a chain of 1-bit adder/subtractor at each column to enable full-adder/subtractor functionality. (c) Read Bit lines (RBL, RBLB) of scale factor and partial sum memories are connected to realize CiM operation. Write Bit line ($WBL_{sf}$) of scale factor memory helps to perform both read and write operations. (d) Detailed view of the column peripheral, illustrating the implementation of in-memory subtraction and addition operations. $CB_{out}$ represents the final carry/borrow output from the column peripheral.
  • Figure 4: Read Compute Store pipeline of DCiM array. $R_{ji}$ represents the addition or subtraction operation between row j and i of scale factor and partial sum memory.
  • Figure 5: (a) Energy to process all the columns of analog CiM crossbar with ternary quantization. (b) Accuracy vs EDAP comparison of HCiM with baselines on ImageNet dataset.
  • ...and 2 more figures