Table of Contents
Fetching ...

On Pruning State-Space LLMs

Tamer Ghattas, Michael Hassid, Roy Schwartz

TL;DR

This paper evaluates pruning for state-space model (SSM)–based LLMs, adapting both unstructured (WANDA) and structured pruning methods to the SSM components across four models and multiple tasks. It finds that unstructured pruning is generally robust, and state pruning can incur only small degradations in several cases, while head pruning severely degrades performance across models. The work highlights that output projections are particularly sensitive to pruning and that the choice of pruning method critically shapes efficiency and accuracy outcomes. Collectively, the results demonstrate a path toward practical, more efficient SSM-based LLMs, while underscoring the need for method-model alignment and further exploration of all components.

Abstract

Recent work proposed state-space models (SSMs) as an efficient alternative to transformer-based LLMs. Can these models be pruned to further reduce their computation costs? We adapt several pruning methods to the SSM structure, and apply them to four SSM-based LLMs across multiple tasks. We find that such models are quite robust to some pruning methods (e.g. WANDA), while using other methods lead to fast performance degradation.

On Pruning State-Space LLMs

TL;DR

This paper evaluates pruning for state-space model (SSM)–based LLMs, adapting both unstructured (WANDA) and structured pruning methods to the SSM components across four models and multiple tasks. It finds that unstructured pruning is generally robust, and state pruning can incur only small degradations in several cases, while head pruning severely degrades performance across models. The work highlights that output projections are particularly sensitive to pruning and that the choice of pruning method critically shapes efficiency and accuracy outcomes. Collectively, the results demonstrate a path toward practical, more efficient SSM-based LLMs, while underscoring the need for method-model alignment and further exploration of all components.

Abstract

Recent work proposed state-space models (SSMs) as an efficient alternative to transformer-based LLMs. Can these models be pruned to further reduce their computation costs? We adapt several pruning methods to the SSM structure, and apply them to four SSM-based LLMs across multiple tasks. We find that such models are quite robust to some pruning methods (e.g. WANDA), while using other methods lead to fast performance degradation.

Paper Structure

This paper contains 28 sections, 10 figures, 14 tables.

Figures (10)

  • Figure 1: Pruning SSM-based LLMs. Right: the Mamba SSM block: the input is linearly projected using five projection matrices ($W_Z$,$W_X$,$W_B$,$W_C$,$W_{\Delta}$), to be used in later parts of the block. Every SSM head is represented using two vectors (two rows for InProj and two columns for OutProj). Left: our different structure pruning methods. Each yellow cell represents a pruned element in the corresponding head. (1) State pruning: head extraction from $W_B$ and $W_C$ tensors then pruning the corresponding conv1d filters; (2) Head dimension pruning: head extraction from $W_X$, $W_Z$, $W_A$, $W_D$ and $W_{\Delta}$, and pruning the corresponding conv1d filters and OutProj rows; (3) Head Merging: mean-pooling every two BC-heads and all corresponding components; (4) SSM-FLAP: adapting FLAP to SSMs, which prunes whole heads on all InProj sub-components heads and their correspondingly conv1d and OutProj.
  • Figure 2: The effect of WANDA pruning ratios on different Mamba-2-2.7B components. OutProj layer is substantially more sensitive to pruning than InProj.
  • Figure 3: Radar plots showing the effect of head dimension pruning ratios on Phi-Mamba-1.5B and HLM-3B across all benchmarks.
  • Figure 4: Radar plots showing the effect of WANDA pruning ratios on all four models across all benchmarks.
  • Figure 5: Radar plots showing the effect of head merging on all four models across all benchmarks.
  • ...and 5 more figures