Table of Contents
Fetching ...

Mamba-Shedder: Post-Transformer Compression for Efficient Selective Structured State Space Models

J. Pablo Muñoz, Jinjie Yuan, Nilesh Jain

TL;DR

The paper tackles the efficiency bottlenecks of sequence models by studying structured pruning in selective structured state space models (SSMs), focusing on Mamba and hybrid architectures. It introduces Mamba-Shedder, a training-free pruning framework that targets Mamba blocks, SSM modules, Transformer subblocks, and channel groups to reduce compute and memory while preserving accuracy. Through extensive experiments on multiple Mamba and hybrid models (e.g., Mamba-2.8B, Zamba-2-2.7B, Hymba, Falcon-Mamba), the authors show that multi-granularity pruning can yield substantial inference speedups (up to ~1.4x) with modest or recoverable losses, especially after recovery tuning. The work offers practical guidance on which components drive efficiency versus accuracy across architectures, highlighting that targeted SSM pruning and MLP channel pruning can complement block pruning to achieve favorable efficiency-accuracy trade-offs in real-world deployment.

Abstract

Large pre-trained models have achieved outstanding results in sequence modeling. The Transformer block and its attention mechanism have been the main drivers of the success of these models. Recently, alternative architectures, such as Selective Structured State Space Models (SSMs), have been proposed to address the inefficiencies of Transformers. This paper explores the compression of SSM-based models, particularly Mamba and its hybrids. We study the sensitivity of these models to the removal of selected components at different granularities to reduce the model size and computational overhead, thus improving their efficiency while maintaining accuracy. The proposed solutions, collectively referred to as Mamba-Shedder, achieve a speedup of up to 1.4x during inference, demonstrating that model efficiency can be improved by eliminating several redundancies with minimal impact on the overall model performance. The code is available at https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning.

Mamba-Shedder: Post-Transformer Compression for Efficient Selective Structured State Space Models

TL;DR

The paper tackles the efficiency bottlenecks of sequence models by studying structured pruning in selective structured state space models (SSMs), focusing on Mamba and hybrid architectures. It introduces Mamba-Shedder, a training-free pruning framework that targets Mamba blocks, SSM modules, Transformer subblocks, and channel groups to reduce compute and memory while preserving accuracy. Through extensive experiments on multiple Mamba and hybrid models (e.g., Mamba-2.8B, Zamba-2-2.7B, Hymba, Falcon-Mamba), the authors show that multi-granularity pruning can yield substantial inference speedups (up to ~1.4x) with modest or recoverable losses, especially after recovery tuning. The work offers practical guidance on which components drive efficiency versus accuracy across architectures, highlighting that targeted SSM pruning and MLP channel pruning can complement block pruning to achieve favorable efficiency-accuracy trade-offs in real-world deployment.

Abstract

Large pre-trained models have achieved outstanding results in sequence modeling. The Transformer block and its attention mechanism have been the main drivers of the success of these models. Recently, alternative architectures, such as Selective Structured State Space Models (SSMs), have been proposed to address the inefficiencies of Transformers. This paper explores the compression of SSM-based models, particularly Mamba and its hybrids. We study the sensitivity of these models to the removal of selected components at different granularities to reduce the model size and computational overhead, thus improving their efficiency while maintaining accuracy. The proposed solutions, collectively referred to as Mamba-Shedder, achieve a speedup of up to 1.4x during inference, demonstrating that model efficiency can be improved by eliminating several redundancies with minimal impact on the overall model performance. The code is available at https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning.

Paper Structure

This paper contains 32 sections, 1 equation, 4 figures, 13 tables, 2 algorithms.

Figures (4)

  • Figure 1: Overview of Mamba-Shedder. This figure illustrates the pruning strategy for three types of Mamba-based models. The first type includes Mamba models such as Mamba-1 mamba1, Mamba-2 mamba2, and Falcon-Mamba zuo2024falcon. The second type comprises Mamba + Transformers architectures, including Zamba glorioso2024zambacompact7bssm. The third type is Hymba dong2024hymba, a novel architecture with hybrid heads. Red dashed lines indicate potential removal. In Transformers, channel pruning can also be applied to MLP block (width pruning).
  • Figure 2: Pruning Mamba blocks. Avg. Accuracy indicates the average accuracy for seven tasks. The model composed of Mamba 1 blocks (left) can tolerate the removal of entire blocks without significantly increasing its perplexity or decreasing accuracy compared to Mamba-2 and Zamba-2. In all three models, removing each Mamba block reduces 0.04B parameters from the model. These are training-free results, and drops in accuracy can be reduced by a subsequent fine-tuning stage (§ 4.5).
  • Figure 3: Pruning SSM (S6 and SSD modules). Mamba-2.8B and Mamba2-2.7B have 64 SSM modules, while Zamba2-2.7B has 54 SSM (SSD) modules. Avg. Accuracy is for the seven tasks evaluated.
  • Figure 4: Close examination of the impact of removing Mamba blocks or SSMs from the two versions of Mamba models reveals distinct differences in their tolerance levels. Mamba-1 exhibits a higher tolerance for removing its blocks, while Mamba-2 exhibits greater tolerance for removing the SSM subcomponent.