Table of Contents
Fetching ...

DAISM: Digital Approximate In-SRAM Multiplier-based Accelerator for DNN Training and Inference

Lorenzo Sonnino, Shaswot Shresthamali, Yuan He, Masaaki Kondo

TL;DR

This work targets the data-movement bottleneck in DNN GEMMs by introducing a digital in-SRAM approximate multiplier and the DAISM accelerator. The multiplier uses bit-parallel full-line activation to perform in-memory multiplication as a wired-OR of partial products, avoiding complex adder trees, and can be augmented with pre-computed partial-sum values to recover accuracy. The DAISM architecture leverages this multiplier, exploring FP mantissa processing with bf16 and PC2/PC3 variants, including truncation modes to trade accuracy for energy and throughput. Across accuracy, energy, and architectural metrics, DAISM demonstrates substantially higher area efficiency than state-of-the-art SRAM-based PIM solutions, with competitive energy efficiency and robust performance when scaling banked SRAM configurations, making it practical for edge and near-edge DNN workloads.

Abstract

DNNs are widely used but face significant computational costs due to matrix multiplications, especially from data movement between the memory and processing units. One promising approach is therefore Processing-in-Memory as it greatly reduces this overhead. However, most PIM solutions rely either on novel memory technologies that have yet to mature or bit-serial computations that have significant performance overhead and scalability issues. Our work proposes an in-SRAM digital multiplier, that uses a conventional memory to perform bit-parallel computations, leveraging multiple wordlines activation. We then introduce DAISM, an architecture leveraging this multiplier, which achieves up to two orders of magnitude higher area efficiency compared to the SOTA counterparts, with competitive energy efficiency.

DAISM: Digital Approximate In-SRAM Multiplier-based Accelerator for DNN Training and Inference

TL;DR

This work targets the data-movement bottleneck in DNN GEMMs by introducing a digital in-SRAM approximate multiplier and the DAISM accelerator. The multiplier uses bit-parallel full-line activation to perform in-memory multiplication as a wired-OR of partial products, avoiding complex adder trees, and can be augmented with pre-computed partial-sum values to recover accuracy. The DAISM architecture leverages this multiplier, exploring FP mantissa processing with bf16 and PC2/PC3 variants, including truncation modes to trade accuracy for energy and throughput. Across accuracy, energy, and architectural metrics, DAISM demonstrates substantially higher area efficiency than state-of-the-art SRAM-based PIM solutions, with competitive energy efficiency and robust performance when scaling banked SRAM configurations, making it practical for edge and near-edge DNN workloads.

Abstract

DNNs are widely used but face significant computational costs due to matrix multiplications, especially from data movement between the memory and processing units. One promising approach is therefore Processing-in-Memory as it greatly reduces this overhead. However, most PIM solutions rely either on novel memory technologies that have yet to mature or bit-serial computations that have significant performance overhead and scalability issues. Our work proposes an in-SRAM digital multiplier, that uses a conventional memory to perform bit-parallel computations, leveraging multiple wordlines activation. We then introduce DAISM, an architecture leveraging this multiplier, which achieves up to two orders of magnitude higher area efficiency compared to the SOTA counterparts, with competitive energy efficiency.
Paper Structure (23 sections, 1 equation, 8 figures, 3 tables)

This paper contains 23 sections, 1 equation, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Example of the proposed multiplier's concept for $a = 1011$ and $b = 0101$. The SRAM line is read if the corresponding bit from the multiplier is 1
  • Figure 2: In PC2, the pre-computed sum between the two largest PP is stored
  • Figure 3: 4 banks DAISM architecture. Inputs are fed one at a time to the SRAM from a register file through the address decoder. The dotted area represents unused SRAM space (not to scale)
  • Figure 4: Accuracy evaluation for larger CNN using bfloat16 truncated PC3 compared to an exact float32 baseline
  • Figure 5: Energy break-down for all the proposed mantissa multipliers compared to a common baseline for either a 32kB or an 8kB SRAM. No-tr represent the spared energy consumption by truncation
  • ...and 3 more figures