Table of Contents
Fetching ...

Towards Efficient SRAM-PIM Architecture Design by Exploiting Unstructured Bit-Level Sparsity

Cenlin Duan, Jianlei Yang, Yiou Wang, Yikun Wang, Yingjie Qi, Xiaolin He, Bonan Yan, Xueyan Wang, Xiaotao Jia, Weisheng Zhao

TL;DR

This paper tackles the inefficiency of traditional SRAM-PIM in exploiting unstructured bit-level sparsity due to rigid crossbar constraints. It introduces DB-PIM, an algorithm-architecture co-design that uses a Fixed Threshold Approximation (FTA) with a dyadic-block (DB) sparsity pattern, aided by Canonical Signed Digit (CSD) encoding and a specialized CSD-based adder tree. The hardware design includes Dyadic Block Multiply Units (DBMUs) and an Input Pre-processing Unit (IPU), enabling dynamic bypass of zero blocks and metadata-guided MAC operations. Through training-time FTA-aware quantization and offline compilation, the framework achieves up to $7.69\times$ speedup and up to $83.43\%$ energy savings, with strong area efficiency relative to prior SRAM-PIM approaches. This work demonstrates a practical path to significantly boost SRAM-PIM performance by leveraging unstructured bit-level sparsity in neural networks.

Abstract

Bit-level sparsity in neural network models harbors immense untapped potential. Eliminating redundant calculations of randomly distributed zero-bits significantly boosts computational efficiency. Yet, traditional digital SRAM-PIM architecture, limited by rigid crossbar architecture, struggles to effectively exploit this unstructured sparsity. To address this challenge, we propose Dyadic Block PIM (DB-PIM), a groundbreaking algorithm-architecture co-design framework. First, we propose an algorithm coupled with a distinctive sparsity pattern, termed a dyadic block (DB), that preserves the random distribution of non-zero bits to maintain accuracy while restricting the number of these bits in each weight to improve regularity. Architecturally, we develop a custom PIM macro that includes dyadic block multiplication units (DBMUs) and Canonical Signed Digit (CSD)-based adder trees, specifically tailored for Multiply-Accumulate (MAC) operations. An input pre-processing unit (IPU) further refines performance and efficiency by capitalizing on block-wise input sparsity. Results show that our proposed co-design framework achieves a remarkable speedup of up to 7.69x and energy savings of 83.43%.

Towards Efficient SRAM-PIM Architecture Design by Exploiting Unstructured Bit-Level Sparsity

TL;DR

This paper tackles the inefficiency of traditional SRAM-PIM in exploiting unstructured bit-level sparsity due to rigid crossbar constraints. It introduces DB-PIM, an algorithm-architecture co-design that uses a Fixed Threshold Approximation (FTA) with a dyadic-block (DB) sparsity pattern, aided by Canonical Signed Digit (CSD) encoding and a specialized CSD-based adder tree. The hardware design includes Dyadic Block Multiply Units (DBMUs) and an Input Pre-processing Unit (IPU), enabling dynamic bypass of zero blocks and metadata-guided MAC operations. Through training-time FTA-aware quantization and offline compilation, the framework achieves up to speedup and up to energy savings, with strong area efficiency relative to prior SRAM-PIM approaches. This work demonstrates a practical path to significantly boost SRAM-PIM performance by leveraging unstructured bit-level sparsity in neural networks.

Abstract

Bit-level sparsity in neural network models harbors immense untapped potential. Eliminating redundant calculations of randomly distributed zero-bits significantly boosts computational efficiency. Yet, traditional digital SRAM-PIM architecture, limited by rigid crossbar architecture, struggles to effectively exploit this unstructured sparsity. To address this challenge, we propose Dyadic Block PIM (DB-PIM), a groundbreaking algorithm-architecture co-design framework. First, we propose an algorithm coupled with a distinctive sparsity pattern, termed a dyadic block (DB), that preserves the random distribution of non-zero bits to maintain accuracy while restricting the number of these bits in each weight to improve regularity. Architecturally, we develop a custom PIM macro that includes dyadic block multiplication units (DBMUs) and Canonical Signed Digit (CSD)-based adder trees, specifically tailored for Multiply-Accumulate (MAC) operations. An input pre-processing unit (IPU) further refines performance and efficiency by capitalizing on block-wise input sparsity. Results show that our proposed co-design framework achieves a remarkable speedup of up to 7.69x and energy savings of 83.43%.
Paper Structure (15 sections, 2 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 15 sections, 2 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Exploitation of bit-level sparsity in SRAM-PIMs.
  • Figure 2: Bit-level sparsity existed in weights ($W$) and input features ($I$) among different models.
  • Figure 3: Overview of the proposed DB-PIM, an algorithm and architecture co-design framework.
  • Figure 4: Extraction and representation of bit-level sparsity patterns.
  • Figure 5: Circuit design of customized SRAM-PIM macro.
  • ...and 2 more figures