Full-Stack Optimization for CAM-Only DNN Inference

João Paulo C. de Lima; Asif Ali Khan; Luigi Carro; Jeronimo Castrillon

Full-Stack Optimization for CAM-Only DNN Inference

João Paulo C. de Lima, Asif Ali Khan, Luigi Carro, Jeronimo Castrillon

TL;DR

The paper tackles the high energy and latency costs of neural network inference in von Neumann systems by co-designing algorithmic and hardware solutions. It combines ternary weight networks with racetrack-memory–based associative processors to enable bulk-bitwise, in-memory convolutions while minimizing data transfers. A compiler flow featuring constant weight folding, loop interchange, unrolling, and common subexpression elimination co-optimizes TWNs for AP execution, mapping computations onto RTM-CAM architectures. Empirical results on ResNet-18/ImageNet and VGG nets show up to 7.5x energy efficiency improvements over crossbar-based accelerators with maintained software accuracy, underscoring the practicality of RTM-APs for scalable DNN inference in memory-centric systems.

Abstract

The accuracy of neural networks has greatly improved across various domains over the past years. Their ever-increasing complexity, however, leads to prohibitively high energy demands and latency in von Neumann systems. Several computing-in-memory (CIM) systems have recently been proposed to overcome this, but trade-offs involving accuracy, hardware reliability, and scalability for large models remain a challenge. Additionally, for some CIM designs, the activation movement still requires considerable time and energy. This paper explores the combination of algorithmic optimizations for ternary weight neural networks and associative processors (APs) implemented using racetrack memory (RTM). We propose a novel compilation flow to optimize convolutions on APs by reducing their arithmetic intensity. By leveraging the benefits of RTM-based APs, this approach substantially reduces data transfers within the memory while addressing accuracy, energy efficiency, and reliability concerns. Concretely, our solution improves the energy efficiency of ResNet-18 inference on ImageNet by 7.5x compared to crossbar in-memory accelerators while retaining software accuracy.

Full-Stack Optimization for CAM-Only DNN Inference

TL;DR

Abstract

Paper Structure (15 sections, 1 equation, 4 figures, 2 tables)

This paper contains 15 sections, 1 equation, 4 figures, 2 tables.

Introduction
Background
DNNs: Quantization and sparsity
Content addressable memories and associative processing
Racetrack memory
RTM-AP: Accelerator architecture
Compilation framework for RTM-APs
Data-flow graph generation
Input mapping and DFG scheduling
Lookup table generation
Experimental setup and evaluation results
Results summary and comparison to state-of-the-art
Impact on performance and energy consumption
Impact on data movement and write endurance
Conclusion

Figures (4)

Figure 1: Direct convolution and im2col transformation
Figure 2: RTM-AP. a) Hierarchical accelerator architecture consisting of banks, tiles, APs, buffers and interconnection network, b) tile showing array of APs, c) AP organization consisting of CAM array, registers, instruction cache and control unit, d) SIMD slot for two-operand addition and mapping of inputs to racetrack domains, e) an RTM nanowire.
Figure 3: a) Compilation flow and optimization techniques used in each step, b) naïve loop in convolutional layers, c) loop after applying loop interchange, unrolling and constant folding of ternary weights, d) loop after loop fission and common subexpression elimination, e) optimized data-flow graph (DFG) for Equation \ref{['eq:1']}.
Figure 4: Layer-by-layer comparison with DNN+NeuroSim peng2019dnn+

Full-Stack Optimization for CAM-Only DNN Inference

TL;DR

Abstract

Full-Stack Optimization for CAM-Only DNN Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (4)