NL-DPE: An Analog In-memory Non-Linear Dot Product Engine for Efficient CNN and LLM Inference
Lei Zhao, Luca Buonanno, Archit Gajjar, John Moon, Aishwarya Natarajan, Sergey Serebryakov, Ron M. Roth, Xia Sheng, Youtao Zhang, Paolo Faraboschi, Jim Ignowski, Giacomo Pedretti
TL;DR
NL-DPE presents an ADC-less analog in-memory computing engine that combines RRAM crossbars for vector-matrix multiplications with ACAM-based decision-tree units to compute non-linear and data-dependent operations in the analog domain. By transforming non-linear functions and data-dependent matrix multiplications into decision-tree/logarithm-exponential computations, NL-DPE eliminates the energy- and area-heavy ADCs and employs Noise-Aware Fine-tuning (NAF) to robustly cope with RRAM noise. The approach enables end-to-end inference for CNNs and large language models, delivering about 28× energy efficiency and 249× speedup versus GPUs, and about 22× energy efficiency and 245× speedup versus prior IMC accelerators, while maintaining high accuracy. This work demonstrates the practicality of ADC-free analog IMC for modern AI workloads and provides a scalable design path for transformer-based inference with relatively low calibration overhead across multiple chips.
Abstract
Resistive Random Access Memory (RRAM) based in-memory computing (IMC) accelerators offer significant performance and energy advantages for deep neural networks (DNNs), but face three major limitations: (1) they support only \textit{static} dot-product operations and cannot accelerate arbitrary non-linear functions or data-dependent multiplications essential to modern LLMs; (2) they demand large, power-hungry analog-to-digital converter (ADC) circuits; and (3) mapping model weights to device conductance introduces errors from cell nonidealities. These challenges hinder scalable and accurate IMC acceleration as models grow. We propose NL-DPE, a Non-Linear Dot Product Engine that overcomes these barriers. NL-DPE augments crosspoint arrays with RRAM-based Analog Content Addressable Memory (ACAM) to execute arbitrary non-linear functions and data-dependent matrix multiplications in the analog domain by transforming them into decision trees, fully eliminating ADCs. To address device noise, NL-DPE uses software-based Noise Aware Fine-tuning (NAF), requiring no in-device calibration. Experiments show that NL-DPE delivers 28X energy efficiency and 249X speedup over a GPU baseline, and 22X energy efficiency and 245X speedup over existing IMC accelerators, while maintaining high accuracy.
