Table of Contents
Fetching ...

Accelerating LLM Inference with Flexible N:M Sparsity via A Fully Digital Compute-in-Memory Accelerator

Akshat Ramachandran, Souvik Kundu, Arnab Raha, Shamik Kundu, Deepak K. Mathaikutty, Tushar Krishna

TL;DR

This work tackles the accuracy-efficiency trade-off in LLM pruning by introducing FLOW, a method to assign per-layer N:M sparsity based on outlier presence and distribution, enabling flexible sparsity patterns. To realize these patterns in hardware, it proposes FlexCiM, a digital compute-in-memory accelerator that partitions a DCiM macro into sub-macros with distribution and merging units to support diverse N:M patterns with modest overhead. Empirical results show FLOW achieves up to 36% accuracy gains over fixed-pattern pruning, while FlexCiM delivers up to 1.75x lower latency and 1.5x lower energy compared with existing sparse accelerators. The combination of FLOW and FlexCiM enables accurate, hardware-efficient inference for both transformer-based and state-space foundation models, with code available at the FLOW repository.

Abstract

Large language model (LLM) pruning with fixed N:M structured sparsity significantly limits the expressivity of the sparse model, yielding sub-optimal performance. In contrast, supporting multiple N:M patterns to provide sparse representational freedom introduces costly overhead in hardware. To address these challenges for LLMs, we first present a flexible layer-wise outlier-density-aware N:M sparsity (FLOW) selection method. FLOW enables the identification of optimal layer-wise N and M values (from a given range) by simultaneously accounting for the presence and distribution of outliers, allowing a higher degree of representational freedom. To deploy sparse models with such N:M flexibility, we then introduce a flexible, low-overhead digital compute-in-memory architecture (FlexCiM). FlexCiM supports diverse sparsity patterns by partitioning a digital CiM (DCiM) macro into smaller sub-macros, which are adaptively aggregated and disaggregated through distribution and merging mechanisms for different N and M values. Extensive experiments on both transformer-based and recurrence-based state space foundation models (SSMs) demonstrate that FLOW outperforms existing alternatives with an accuracy improvement of up to 36%, while FlexCiM achieves up to 1.75x lower inference latency and 1.5x lower energy consumption compared to existing sparse accelerators. Code is available at: https://github.com/FLOW-open-project/FLOW

Accelerating LLM Inference with Flexible N:M Sparsity via A Fully Digital Compute-in-Memory Accelerator

TL;DR

This work tackles the accuracy-efficiency trade-off in LLM pruning by introducing FLOW, a method to assign per-layer N:M sparsity based on outlier presence and distribution, enabling flexible sparsity patterns. To realize these patterns in hardware, it proposes FlexCiM, a digital compute-in-memory accelerator that partitions a DCiM macro into sub-macros with distribution and merging units to support diverse N:M patterns with modest overhead. Empirical results show FLOW achieves up to 36% accuracy gains over fixed-pattern pruning, while FlexCiM delivers up to 1.75x lower latency and 1.5x lower energy compared with existing sparse accelerators. The combination of FLOW and FlexCiM enables accurate, hardware-efficient inference for both transformer-based and state-space foundation models, with code available at the FLOW repository.

Abstract

Large language model (LLM) pruning with fixed N:M structured sparsity significantly limits the expressivity of the sparse model, yielding sub-optimal performance. In contrast, supporting multiple N:M patterns to provide sparse representational freedom introduces costly overhead in hardware. To address these challenges for LLMs, we first present a flexible layer-wise outlier-density-aware N:M sparsity (FLOW) selection method. FLOW enables the identification of optimal layer-wise N and M values (from a given range) by simultaneously accounting for the presence and distribution of outliers, allowing a higher degree of representational freedom. To deploy sparse models with such N:M flexibility, we then introduce a flexible, low-overhead digital compute-in-memory architecture (FlexCiM). FlexCiM supports diverse sparsity patterns by partitioning a digital CiM (DCiM) macro into smaller sub-macros, which are adaptively aggregated and disaggregated through distribution and merging mechanisms for different N and M values. Extensive experiments on both transformer-based and recurrence-based state space foundation models (SSMs) demonstrate that FLOW outperforms existing alternatives with an accuracy improvement of up to 36%, while FlexCiM achieves up to 1.75x lower inference latency and 1.5x lower energy consumption compared to existing sparse accelerators. Code is available at: https://github.com/FLOW-open-project/FLOW

Paper Structure

This paper contains 18 sections, 2 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Normalized energy comparison between several digital and DCiM accelerators and the proposed FlexCiM across different models for flexible N:M inference acceleration.
  • Figure 2: (a) Outlier distribution measured by pairwise ${L}_1$ distance between outliers (normalized to BERT-large) across different models. (b) Intra-model variations in outlier distribution across different layers of a LLaMA3-8B model.
  • Figure 3: Example of efficient N:M assignment based on outlier presence and distribution in different situations.
  • Figure 4: (a) FlexCiM overview with a partition size $P=4$; (b) Organization of a single column of a FlexCiM sub-macro; (c) Memory cell structure.
  • Figure 5: Example of FlexCiM running 1:4 and 4:8 sparsity patterns.
  • ...and 2 more figures