Table of Contents
Fetching ...

NL-DPE: An Analog In-memory Non-Linear Dot Product Engine for Efficient CNN and LLM Inference

Lei Zhao, Luca Buonanno, Archit Gajjar, John Moon, Aishwarya Natarajan, Sergey Serebryakov, Ron M. Roth, Xia Sheng, Youtao Zhang, Paolo Faraboschi, Jim Ignowski, Giacomo Pedretti

TL;DR

NL-DPE presents an ADC-less analog in-memory computing engine that combines RRAM crossbars for vector-matrix multiplications with ACAM-based decision-tree units to compute non-linear and data-dependent operations in the analog domain. By transforming non-linear functions and data-dependent matrix multiplications into decision-tree/logarithm-exponential computations, NL-DPE eliminates the energy- and area-heavy ADCs and employs Noise-Aware Fine-tuning (NAF) to robustly cope with RRAM noise. The approach enables end-to-end inference for CNNs and large language models, delivering about 28× energy efficiency and 249× speedup versus GPUs, and about 22× energy efficiency and 245× speedup versus prior IMC accelerators, while maintaining high accuracy. This work demonstrates the practicality of ADC-free analog IMC for modern AI workloads and provides a scalable design path for transformer-based inference with relatively low calibration overhead across multiple chips.

Abstract

Resistive Random Access Memory (RRAM) based in-memory computing (IMC) accelerators offer significant performance and energy advantages for deep neural networks (DNNs), but face three major limitations: (1) they support only \textit{static} dot-product operations and cannot accelerate arbitrary non-linear functions or data-dependent multiplications essential to modern LLMs; (2) they demand large, power-hungry analog-to-digital converter (ADC) circuits; and (3) mapping model weights to device conductance introduces errors from cell nonidealities. These challenges hinder scalable and accurate IMC acceleration as models grow. We propose NL-DPE, a Non-Linear Dot Product Engine that overcomes these barriers. NL-DPE augments crosspoint arrays with RRAM-based Analog Content Addressable Memory (ACAM) to execute arbitrary non-linear functions and data-dependent matrix multiplications in the analog domain by transforming them into decision trees, fully eliminating ADCs. To address device noise, NL-DPE uses software-based Noise Aware Fine-tuning (NAF), requiring no in-device calibration. Experiments show that NL-DPE delivers 28X energy efficiency and 249X speedup over a GPU baseline, and 22X energy efficiency and 245X speedup over existing IMC accelerators, while maintaining high accuracy.

NL-DPE: An Analog In-memory Non-Linear Dot Product Engine for Efficient CNN and LLM Inference

TL;DR

NL-DPE presents an ADC-less analog in-memory computing engine that combines RRAM crossbars for vector-matrix multiplications with ACAM-based decision-tree units to compute non-linear and data-dependent operations in the analog domain. By transforming non-linear functions and data-dependent matrix multiplications into decision-tree/logarithm-exponential computations, NL-DPE eliminates the energy- and area-heavy ADCs and employs Noise-Aware Fine-tuning (NAF) to robustly cope with RRAM noise. The approach enables end-to-end inference for CNNs and large language models, delivering about 28× energy efficiency and 249× speedup versus GPUs, and about 22× energy efficiency and 245× speedup versus prior IMC accelerators, while maintaining high accuracy. This work demonstrates the practicality of ADC-free analog IMC for modern AI workloads and provides a scalable design path for transformer-based inference with relatively low calibration overhead across multiple chips.

Abstract

Resistive Random Access Memory (RRAM) based in-memory computing (IMC) accelerators offer significant performance and energy advantages for deep neural networks (DNNs), but face three major limitations: (1) they support only \textit{static} dot-product operations and cannot accelerate arbitrary non-linear functions or data-dependent multiplications essential to modern LLMs; (2) they demand large, power-hungry analog-to-digital converter (ADC) circuits; and (3) mapping model weights to device conductance introduces errors from cell nonidealities. These challenges hinder scalable and accurate IMC acceleration as models grow. We propose NL-DPE, a Non-Linear Dot Product Engine that overcomes these barriers. NL-DPE augments crosspoint arrays with RRAM-based Analog Content Addressable Memory (ACAM) to execute arbitrary non-linear functions and data-dependent matrix multiplications in the analog domain by transforming them into decision trees, fully eliminating ADCs. To address device noise, NL-DPE uses software-based Noise Aware Fine-tuning (NAF), requiring no in-device calibration. Experiments show that NL-DPE delivers 28X energy efficiency and 249X speedup over a GPU baseline, and 22X energy efficiency and 245X speedup over existing IMC accelerators, while maintaining high accuracy.

Paper Structure

This paper contains 35 sections, 9 equations, 16 figures, 6 tables, 1 algorithm.

Figures (16)

  • Figure 1: The energy breakdown of RRAM IMC accelerators, namely ISAAC shafiee2016isaac (blue) and RAELLA andrulis2023raella (red). We focus on optimizing ADC and VFU, i.e., the bars with solid colors.
  • Figure 2: (a) Computing the VMM of in a crossbar array with 1T1R RRAM cells. (b) Example of a trained decision tree. (c) ACAM cell using 1T1R (d) Mapping the decision tree of (a) onto ACAM and performing inference.
  • Figure 3: (a) Attention mechanism and its mapping into conventional DPE architectures either by (b) reprogramming the DPE arrays or (c) using a VFU for DMMul.
  • Figure 4: (a) Conceptual representation of conventional DPE (left) and NL-DPE (right), which replaces ADC and digital computing logic. (b) Schematic of the accelerator tiled architecture. (c) Tile schematic. (d) Core schematic. (e) ACAM unit schematic.
  • Figure 5: (a) The Sigmoid function with output being quantized to 3 bits in unsigned binary ($y$ on the left Y axis) and in Gray code format ($g$ on the right Y axis). (b) Training dataset for predicting $y_1$ and $g_1$. (c) Trained DT to predict $y_1$. (d) Mapping (c) into an ACAM. (e) Trained DT to predict $g_1$. (f) Mapping of (e) into an ACAM.
  • ...and 11 more figures