Table of Contents
Fetching ...

Minitron-SSM: Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning

Ali Taghibakhshi, Sharath Turuvekere Sreenivas, Saurav Muralidharan, Marcin Chochowski, Yashaswi Karnati, Raviraj Joshi, Ameya Sunil Mahabaleshwarkar, Zijia Chen, Yoshi Suhara, Oluwatobi Olabiyi, Daniel Korzekwa, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro, Ashwath Aithal, Nima Tajbakhsh, Pavlo Molchanov

TL;DR

This paper introduces Nemotron-H 4B, a compressed hybrid language model that merges Attention with State Space Models (SSMs) using a novel group-aware pruning strategy and knowledge distillation. The method jointly prunes Mamba heads, head channels, FFN neurons, embedding dimensions, and model depth, guided by activation-based and FLAP-based importance, followed by KD-based accuracy recovery. A combinatorial architecture search identifies 4B configurations that maximize accuracy and throughput, achieving up to ~40x reduction in training tokens while retaining >96% of the 8B model's performance and delivering ~2x faster inference. The authors demonstrate strong long-context capabilities (up to 128k tokens) and generalize the compression approach to Mamba2, underscoring practical impact for deploying efficient, high-performance hybrid LLMs. They also contribute an open, reproducible compression recipe to the community, enabling broader adoption of efficient hybrid architectures in resource-constrained settings.

Abstract

Hybrid LLM architectures that combine Attention and State Space Models (SSMs) achieve state-of-the-art accuracy and runtime performance. Recent work has demonstrated that applying compression and distillation to Attention-only models yields smaller, more accurate models at a fraction of the training cost. In this work, we explore the effectiveness of compressing Hybrid architectures. We introduce a novel group-aware pruning strategy that preserves the structural integrity of SSM blocks and their sequence modeling capabilities. Furthermore, we demonstrate the necessity of such SSM pruning to achieve improved accuracy and inference speed compared to traditional approaches. Our compression recipe combines SSM, FFN, embedding dimension, and layer pruning, followed by knowledge distillation-based retraining, similar to the MINITRON technique. Using this approach, we compress the Nemotron-H 8B Hybrid model down to 4B parameters with up to 40x fewer training tokens. The resulting model surpasses the accuracy of similarly-sized models while achieving 2x faster inference, significantly advancing the Pareto frontier.

Minitron-SSM: Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning

TL;DR

This paper introduces Nemotron-H 4B, a compressed hybrid language model that merges Attention with State Space Models (SSMs) using a novel group-aware pruning strategy and knowledge distillation. The method jointly prunes Mamba heads, head channels, FFN neurons, embedding dimensions, and model depth, guided by activation-based and FLAP-based importance, followed by KD-based accuracy recovery. A combinatorial architecture search identifies 4B configurations that maximize accuracy and throughput, achieving up to ~40x reduction in training tokens while retaining >96% of the 8B model's performance and delivering ~2x faster inference. The authors demonstrate strong long-context capabilities (up to 128k tokens) and generalize the compression approach to Mamba2, underscoring practical impact for deploying efficient, high-performance hybrid LLMs. They also contribute an open, reproducible compression recipe to the community, enabling broader adoption of efficient hybrid architectures in resource-constrained settings.

Abstract

Hybrid LLM architectures that combine Attention and State Space Models (SSMs) achieve state-of-the-art accuracy and runtime performance. Recent work has demonstrated that applying compression and distillation to Attention-only models yields smaller, more accurate models at a fraction of the training cost. In this work, we explore the effectiveness of compressing Hybrid architectures. We introduce a novel group-aware pruning strategy that preserves the structural integrity of SSM blocks and their sequence modeling capabilities. Furthermore, we demonstrate the necessity of such SSM pruning to achieve improved accuracy and inference speed compared to traditional approaches. Our compression recipe combines SSM, FFN, embedding dimension, and layer pruning, followed by knowledge distillation-based retraining, similar to the MINITRON technique. Using this approach, we compress the Nemotron-H 8B Hybrid model down to 4B parameters with up to 40x fewer training tokens. The resulting model surpasses the accuracy of similarly-sized models while achieving 2x faster inference, significantly advancing the Pareto frontier.

Paper Structure

This paper contains 24 sections, 18 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Comparison of Nemotron-H 4B model accuracy w.r.t. inference throughput (left), and training budget for the base model (right) to similarly-sized community models. Inference throughput is measured at an input and output sequence length of 65536 and 1024, respectively.
  • Figure 2: Overview of pruning and distillation for hybrid architectures. Starting from a pretrained LLM, we first evaluate the importance of Mamba heads and channels, FFN neurons, and embedding channels. We then rank them, trim the least important neurons, and distill the knowledge from the original LLM to the pruned model. Attention layers are not pruned since they amount to only 8% of the total number of layers.
  • Figure 3: Mamba group structure visualization showing broadcasting and original $B_tx_t$ computation. Colors represent distinct entries. The Figure illustrates how only within-group head permutations can preserve SSM semantics. As a counter example, if H3 and H8 were to be swapped, the resulting $B_tx_t$ would NOT be any permutation of the original (no permutation) $B_tx_t$.
  • Figure 4: Layer importance measured as the KLD between logits of the full model and a model with that layer removed, averaged over a small training subset. Vertical dotted lines indicate layer types: self-attention (green), FFN (blue), and Mamba2 (red).
  • Figure 5: Accuracy drop relative to the 8B model across progressively depth-only pruned variants (48, 44, 40, 36, and 26 layers). Each model is directly pruned from the 8B and distilled using 126B tokens.
  • ...and 3 more figures