DAISM: Digital Approximate In-SRAM Multiplier-based Accelerator for DNN Training and Inference

Lorenzo Sonnino; Shaswot Shresthamali; Yuan He; Masaaki Kondo

DAISM: Digital Approximate In-SRAM Multiplier-based Accelerator for DNN Training and Inference

Lorenzo Sonnino, Shaswot Shresthamali, Yuan He, Masaaki Kondo

TL;DR

This work targets the data-movement bottleneck in DNN GEMMs by introducing a digital in-SRAM approximate multiplier and the DAISM accelerator. The multiplier uses bit-parallel full-line activation to perform in-memory multiplication as a wired-OR of partial products, avoiding complex adder trees, and can be augmented with pre-computed partial-sum values to recover accuracy. The DAISM architecture leverages this multiplier, exploring FP mantissa processing with bf16 and PC2/PC3 variants, including truncation modes to trade accuracy for energy and throughput. Across accuracy, energy, and architectural metrics, DAISM demonstrates substantially higher area efficiency than state-of-the-art SRAM-based PIM solutions, with competitive energy efficiency and robust performance when scaling banked SRAM configurations, making it practical for edge and near-edge DNN workloads.

Abstract

DNNs are widely used but face significant computational costs due to matrix multiplications, especially from data movement between the memory and processing units. One promising approach is therefore Processing-in-Memory as it greatly reduces this overhead. However, most PIM solutions rely either on novel memory technologies that have yet to mature or bit-serial computations that have significant performance overhead and scalability issues. Our work proposes an in-SRAM digital multiplier, that uses a conventional memory to perform bit-parallel computations, leveraging multiple wordlines activation. We then introduce DAISM, an architecture leveraging this multiplier, which achieves up to two orders of magnitude higher area efficiency compared to the SOTA counterparts, with competitive energy efficiency.

DAISM: Digital Approximate In-SRAM Multiplier-based Accelerator for DNN Training and Inference

TL;DR

Abstract

Paper Structure (23 sections, 1 equation, 8 figures, 3 tables)

This paper contains 23 sections, 1 equation, 8 figures, 3 tables.

Introduction
Background and related works
Binary multiplication
Related works
Proposed multipliers
Core concept
Storing pre-computed values
Floating point generalization
Accelerator architecture
Core architecture
Architecture variations
Evaluation
Accuracy
Methodology
Results
...and 8 more sections

Figures (8)

Figure 1: Example of the proposed multiplier's concept for $a = 1011$ and $b = 0101$. The SRAM line is read if the corresponding bit from the multiplier is 1
Figure 2: In PC2, the pre-computed sum between the two largest PP is stored
Figure 3: 4 banks DAISM architecture. Inputs are fed one at a time to the SRAM from a register file through the address decoder. The dotted area represents unused SRAM space (not to scale)
Figure 4: Accuracy evaluation for larger CNN using bfloat16 truncated PC3 compared to an exact float32 baseline
Figure 5: Energy break-down for all the proposed mantissa multipliers compared to a common baseline for either a 32kB or an 8kB SRAM. No-tr represent the spared energy consumption by truncation
...and 3 more figures

DAISM: Digital Approximate In-SRAM Multiplier-based Accelerator for DNN Training and Inference

TL;DR

Abstract

DAISM: Digital Approximate In-SRAM Multiplier-based Accelerator for DNN Training and Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (8)