Table of Contents
Fetching ...

GHOST: Unmasking Phantom States in Mamba2 via Grouped Hidden-state Output-aware Selection & Truncation

Michael Menezes, Anastasios Kyrillidis

TL;DR

GHOST tackles the memory bandwidth bottleneck in Mamba2 by structurally pruning the recurrent state dimension $N$ using forward-pass statistics. It leverages a data-driven, balanced truncation–inspired approach that jointly measures controllability and observability to identify phantom states and discard corporeal ones, without gradient computations. Across 130M–2.7B models and WikiText-2, GHOST achieves about 50% state reduction with minimal perplexity penalties, while outperforming or matching gradient-based methods in robustness to sequence length, model scale, and OOD scenarios. The method enables practical deployment by reducing memory bandwidth requirements and offering stable performance across diverse tasks and calibrations.

Abstract

While Mamba2's expanded state dimension enhances temporal modeling, it incurs substantial inference overhead that saturates bandwidth during autoregressive generation. Standard pruning methods fail to address this bottleneck: unstructured sparsity leaves activations dense, magnitude-based selection ignores runtime dynamics, and gradient-based methods impose prohibitive costs. We introduce GHOST (Grouped Hidden-state Output-aware Selection and Truncation), a structured pruning framework that approximates control-theoretic balanced truncation using only forward-pass statistics. By jointly measuring controllability and observability, GHOST rivals the fidelity of gradient-based methods without requiring backpropagation. As a highlight, on models ranging from 130M to 2.7B parameters, our approach achieves a 50\% state-dimension reduction with approximately 1 perplexity point increase on WikiText-2. Code is available at https://anonymous.4open.science/r/mamba2_ghost-7BCB/.

GHOST: Unmasking Phantom States in Mamba2 via Grouped Hidden-state Output-aware Selection & Truncation

TL;DR

GHOST tackles the memory bandwidth bottleneck in Mamba2 by structurally pruning the recurrent state dimension using forward-pass statistics. It leverages a data-driven, balanced truncation–inspired approach that jointly measures controllability and observability to identify phantom states and discard corporeal ones, without gradient computations. Across 130M–2.7B models and WikiText-2, GHOST achieves about 50% state reduction with minimal perplexity penalties, while outperforming or matching gradient-based methods in robustness to sequence length, model scale, and OOD scenarios. The method enables practical deployment by reducing memory bandwidth requirements and offering stable performance across diverse tasks and calibrations.

Abstract

While Mamba2's expanded state dimension enhances temporal modeling, it incurs substantial inference overhead that saturates bandwidth during autoregressive generation. Standard pruning methods fail to address this bottleneck: unstructured sparsity leaves activations dense, magnitude-based selection ignores runtime dynamics, and gradient-based methods impose prohibitive costs. We introduce GHOST (Grouped Hidden-state Output-aware Selection and Truncation), a structured pruning framework that approximates control-theoretic balanced truncation using only forward-pass statistics. By jointly measuring controllability and observability, GHOST rivals the fidelity of gradient-based methods without requiring backpropagation. As a highlight, on models ranging from 130M to 2.7B parameters, our approach achieves a 50\% state-dimension reduction with approximately 1 perplexity point increase on WikiText-2. Code is available at https://anonymous.4open.science/r/mamba2_ghost-7BCB/.
Paper Structure (25 sections, 16 equations, 7 figures, 10 tables, 1 algorithm)

This paper contains 25 sections, 16 equations, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: The Proxy Failure: Static Magnitude vs. Dynamic Energy. We analyze the correlation between weight-based importance and actual runtime usage for Mamba2-1.3B. Each point represents one of the 6,144 hidden states ($48$ layers with $128$ states each). The $x$-axis depicts the standardized static score derived from projection weights ($W_{\text{score}} = \sqrt{\| (\bm{W}_{\bm{B}})_{g,i} \|_2 \| (\bm{W}_{\bm{C}})_{g,i} \|_2}$ for state $i$ in group $g$), while the $y$-axis tracks the standardized dynamic energy as in Eq. \ref{['eq:salience']}. The lack of positive correlation reveals two critical failure modes: Phantom States (top-left), which are highly active despite low weight norms, and Corporeal States (bottom-right), which have large weights but low utilization. GHOST is designed to identify the former and prune the latter.
  • Figure 2: Overview of the Mamba2 forward pass with GHOST. The input $u'_t$ is projected into intermediate variables (Subfigure 1.) and discretized parameters (Subfigure 2.) to update the hidden state $\bm{H}_{t,h}$ (Subfigure 3.) and compute the final SSM output $\bm{y}^{\text{SSM}}_{t,h}$ (Subfigure 4.). Concurrently, the GHOST mechanism computes scores $\bm{S}_t^{(g)}$ and applies sorting and thresholding to generate a binary mask $\bm{m} \in \mathbb{R}^{G \cdot N}$ that induces sparsity in the initial projection (Subfigure 5.).
  • Figure 3: Time and Space Efficiency. (Left) FLOPs normalized by the computation required for a single forward pass. (Right) Peak VRAM requirements by method. Note that Taylor exceeds the memory capacity of 40 GB cards.
  • Figure 4: Length Robustness. Comparison of perplexity degradation when evaluating on increasingly shorter contexts. We denote calibration free methods with *.
  • Figure 5: Zero-Shot Performance. Accuracy on Lambada, PIQA, ARC-e, and ARC-c.
  • ...and 2 more figures