Mamba-Shedder: Post-Transformer Compression for Efficient Selective Structured State Space Models
J. Pablo Muñoz, Jinjie Yuan, Nilesh Jain
TL;DR
The paper tackles the efficiency bottlenecks of sequence models by studying structured pruning in selective structured state space models (SSMs), focusing on Mamba and hybrid architectures. It introduces Mamba-Shedder, a training-free pruning framework that targets Mamba blocks, SSM modules, Transformer subblocks, and channel groups to reduce compute and memory while preserving accuracy. Through extensive experiments on multiple Mamba and hybrid models (e.g., Mamba-2.8B, Zamba-2-2.7B, Hymba, Falcon-Mamba), the authors show that multi-granularity pruning can yield substantial inference speedups (up to ~1.4x) with modest or recoverable losses, especially after recovery tuning. The work offers practical guidance on which components drive efficiency versus accuracy across architectures, highlighting that targeted SSM pruning and MLP channel pruning can complement block pruning to achieve favorable efficiency-accuracy trade-offs in real-world deployment.
Abstract
Large pre-trained models have achieved outstanding results in sequence modeling. The Transformer block and its attention mechanism have been the main drivers of the success of these models. Recently, alternative architectures, such as Selective Structured State Space Models (SSMs), have been proposed to address the inefficiencies of Transformers. This paper explores the compression of SSM-based models, particularly Mamba and its hybrids. We study the sensitivity of these models to the removal of selected components at different granularities to reduce the model size and computational overhead, thus improving their efficiency while maintaining accuracy. The proposed solutions, collectively referred to as Mamba-Shedder, achieve a speedup of up to 1.4x during inference, demonstrating that model efficiency can be improved by eliminating several redundancies with minimal impact on the overall model performance. The code is available at https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning.
