GHOST: Unmasking Phantom States in Mamba2 via Grouped Hidden-state Output-aware Selection & Truncation
Michael Menezes, Anastasios Kyrillidis
TL;DR
GHOST tackles the memory bandwidth bottleneck in Mamba2 by structurally pruning the recurrent state dimension $N$ using forward-pass statistics. It leverages a data-driven, balanced truncation–inspired approach that jointly measures controllability and observability to identify phantom states and discard corporeal ones, without gradient computations. Across 130M–2.7B models and WikiText-2, GHOST achieves about 50% state reduction with minimal perplexity penalties, while outperforming or matching gradient-based methods in robustness to sequence length, model scale, and OOD scenarios. The method enables practical deployment by reducing memory bandwidth requirements and offering stable performance across diverse tasks and calibrations.
Abstract
While Mamba2's expanded state dimension enhances temporal modeling, it incurs substantial inference overhead that saturates bandwidth during autoregressive generation. Standard pruning methods fail to address this bottleneck: unstructured sparsity leaves activations dense, magnitude-based selection ignores runtime dynamics, and gradient-based methods impose prohibitive costs. We introduce GHOST (Grouped Hidden-state Output-aware Selection and Truncation), a structured pruning framework that approximates control-theoretic balanced truncation using only forward-pass statistics. By jointly measuring controllability and observability, GHOST rivals the fidelity of gradient-based methods without requiring backpropagation. As a highlight, on models ranging from 130M to 2.7B parameters, our approach achieves a 50\% state-dimension reduction with approximately 1 perplexity point increase on WikiText-2. Code is available at https://anonymous.4open.science/r/mamba2_ghost-7BCB/.
