Table of Contents
Fetching ...

A 33.6-136.2 TOPS/W Nonlinear Analog Computing-In-Memory Macro for Multi-bit LSTM Accelerator in 65 nm CMOS

Junyi Yang, Xinyu Luo, Ye Ke, Zheng Wang, Hongyang Shang, Shuai Dong, Zhengnan Fu, Xiaofeng Yang, Hongjie Liu, Arindam Basu

TL;DR

This work tackles the energy bottleneck in ACIM-based LSTM accelerators by moving nonlinear activation computation into the analog domain through a reconfigurable nonlinear in-memory ADC (NLIM ADC). It introduces a holistic ACIM macro with a dual-9T SRAM bitcell for signed inputs and ternary weights, a read-line underdrive Cascode (RUDC) to boost dynamic range and linearity, and a dual-supply 6T-SRAM scheme for multi-bit weights, coupled with a NLIM ADC that achieves <1 LSB error in NL activations. Experimental measurements and simulations show a 92% on-chip accuracy for a 12-class keyword spotting task, with the macro delivering 2.2× higher system-level energy efficiency and 1.6× better area efficiency than state-of-the-art RNN accelerators, aided by robust temperature performance via replica bias. The approach significantly reduces data movement and latency for LSTM inference, enabling practical, energy-efficient edge AI accelerators.

Abstract

The energy efficiency of analog computing-in-memory (ACIM) accelerator for recurrent neural networks, particularly long short-term memory (LSTM) network, is limited by the high proportion of nonlinear (NL) operations typically executed digitally. To address this, we propose an LSTM accelerator incorporating an ACIM macro with reconfigurable (1-5 bit) nonlinear in-memory (NLIM) analog-to-digital converter (ADC) to compute NL activations directly in the analog domain using: 1) a dual 9T bitcell with decoupled read/write paths for signed inputs and ternary weight operations; 2) a read-word-line underdrive Cascode (RUDC) technique achieving 2.8X higher read-bitline dynamic range than single-transistor designs (1.4X better over conventional Cascode structure with 7X lower current variation); 3) a dual-supply 6T-SRAM array for efficient multi-bit weight operations and reducing both bitcell count (7.8X) and latency (4X) for 5-bit weight operations. We experimentally demonstrate 5-bit NLIM ADC for approximating NL activations in LSTM cells, achieving average error <1 LSB. Simulation confirms the robustness of NLIM ADC against temperature variations thanks to the replica bias strategy. Our design achieves 92.0% on-chip inference accuracy for a 12-class keyword-spotting task while demonstrating 2.2X higher system-level normalized energy efficiency and 1.6X better normalized area efficiency than state-of-the-art works. The results combine physical measurements of a macro unit-accounting for the majority of LSTM operations (99% linear and 80% nonlinear operations)-with simulations of the remaining components, including additional LSTM and fully connected layers.

A 33.6-136.2 TOPS/W Nonlinear Analog Computing-In-Memory Macro for Multi-bit LSTM Accelerator in 65 nm CMOS

TL;DR

This work tackles the energy bottleneck in ACIM-based LSTM accelerators by moving nonlinear activation computation into the analog domain through a reconfigurable nonlinear in-memory ADC (NLIM ADC). It introduces a holistic ACIM macro with a dual-9T SRAM bitcell for signed inputs and ternary weights, a read-line underdrive Cascode (RUDC) to boost dynamic range and linearity, and a dual-supply 6T-SRAM scheme for multi-bit weights, coupled with a NLIM ADC that achieves <1 LSB error in NL activations. Experimental measurements and simulations show a 92% on-chip accuracy for a 12-class keyword spotting task, with the macro delivering 2.2× higher system-level energy efficiency and 1.6× better area efficiency than state-of-the-art RNN accelerators, aided by robust temperature performance via replica bias. The approach significantly reduces data movement and latency for LSTM inference, enabling practical, energy-efficient edge AI accelerators.

Abstract

The energy efficiency of analog computing-in-memory (ACIM) accelerator for recurrent neural networks, particularly long short-term memory (LSTM) network, is limited by the high proportion of nonlinear (NL) operations typically executed digitally. To address this, we propose an LSTM accelerator incorporating an ACIM macro with reconfigurable (1-5 bit) nonlinear in-memory (NLIM) analog-to-digital converter (ADC) to compute NL activations directly in the analog domain using: 1) a dual 9T bitcell with decoupled read/write paths for signed inputs and ternary weight operations; 2) a read-word-line underdrive Cascode (RUDC) technique achieving 2.8X higher read-bitline dynamic range than single-transistor designs (1.4X better over conventional Cascode structure with 7X lower current variation); 3) a dual-supply 6T-SRAM array for efficient multi-bit weight operations and reducing both bitcell count (7.8X) and latency (4X) for 5-bit weight operations. We experimentally demonstrate 5-bit NLIM ADC for approximating NL activations in LSTM cells, achieving average error <1 LSB. Simulation confirms the robustness of NLIM ADC against temperature variations thanks to the replica bias strategy. Our design achieves 92.0% on-chip inference accuracy for a 12-class keyword-spotting task while demonstrating 2.2X higher system-level normalized energy efficiency and 1.6X better normalized area efficiency than state-of-the-art works. The results combine physical measurements of a macro unit-accounting for the majority of LSTM operations (99% linear and 80% nonlinear operations)-with simulations of the remaining components, including additional LSTM and fully connected layers.

Paper Structure

This paper contains 18 sections, 1 equation, 12 figures, 6 tables, 1 algorithm.

Figures (12)

  • Figure 1: Limitations and solutions of current CIM for RNNs: (a) A survey of RNN accelerators and other CIM-based DNN accelerators (All energy efficiencies are normalized to 1-bit input and 1-bit weight, according to this formula song20234: Normalized EE=EE × input precision × weight precision). (b) One LSTM cell with a large number of nonlinear activations.(c) Energy efficiency of various sub-parts of the LSTM accelerator in previous work (Nature'23,ambrogio2023analog). (d) Architecture comparison of our proposed method with the conventional method for LSTM accelerator.
  • Figure 2: Hardware block diagram: (a) Macro architecture in the test chip (BUF: buffer). (b) Circuit timing diagram of one column (PCH: precharge). (c) Voltage waveforms diagrams of RBL and outputs of SA based on post-layout simulation. (d) In-memory ternary multiplication of proposed dual 9T SRAM bitcell.
  • Figure 3: (a) RUDC for enhanced linearity and DR. (b) Implementation of multi-bit weight using dual-supply for 6T-SRAM array. (c) Three methods (proposed, TCASI'25 dong2025topkima, JSSC'22 yu202265) for implementing 5-bit signed weight. (d) Comparison of input latency and cell number for implementing multi-bit weight.
  • Figure 4: Monte Carlo simulations for the ratio $n_{BWR}=2$ under the conditions of $V_{MSB}$ = 0.45 V and $V_{LSB}$ = 0.42 V.
  • Figure 5: (a) Traditional single-slope ADC. (b) Our Ramp NL ADC. (c) The inverse of the sigmoid function. (d) The value of each step of the ramp voltage $V_k$ denoted by $\Delta V_k$. (e) Integer quantized $\Delta V_k$ for implementation in hardware.
  • ...and 7 more figures