Table of Contents
Fetching ...

Apriel-H1: Towards Efficient Enterprise Reasoning Models

Oleksiy Ostapenko, Luke Kumar, Raymond Li, Denis Kocetkov, Joel Lamy-Poirier, Shruthan Radhakrishna, Soham Parikh, Shambhavi Mishra, Sebastien Paquet, Srinivas Sunkara, Valérie Bécaert, Sathwik Tejaswi Madhusudhan, Torsten Scholak

TL;DR

The paper tackles the bottleneck of quadratic attention in large transformers by introducing Apriel-H1 hybrids that fuse transformer attention with linear State Space Model (SSM) mixers (Mamba) at 15B, distilled from a reasoning teacher. Through a staged, reverse-KL distillation pipeline and layer-importance heuristics (LOO and MIL-MMR), the authors progressively replace MHA layers with Mamba blocks and fine-tune to preserve reasoning quality. The resulting hybrids achieve substantial inference throughput gains—up to around $3.4\times$ higher throughput with minimal degradation on reasoning benchmarks—while maintaining competitive reasoning capabilities, particularly when combined with supervised fine-tuning (SFT). The work highlights the practicality of distillation-based hybridization over from-scratch pretraining due to its favorable data and compute balance, and outlines future directions for more advanced linear mixers and adaptive layer scheduling to further improve the efficiency/accuracy frontier.

Abstract

Large Language Models (LLMs) achieve remarkable reasoning capabilities through transformer architectures with attention mechanisms. However, transformers suffer from quadratic time and memory complexity in the attention module (MHA) and require caching key-value states during inference, which severely limits throughput and scalability. High inference throughput is critical for agentic tasks, long-context reasoning, efficient deployment under high request loads, and more efficient test-time compute scaling. State Space Models (SSMs) such as Mamba offer a promising alternative with linear inference complexity and a constant memory footprint via recurrent computation with fixed-size hidden states. In this technical report we introduce the Apriel-H1 family of hybrid LLMs that combine transformer attention and SSM sequence mixers for efficient reasoning at 15B model size. These models are obtained through incremental distillation from a pretrained reasoning transformer, Apriel-Nemotron-15B-Thinker, progressively replacing less critical attention layers with linear Mamba blocks. We release multiple post-distillation variants of Apriel-H1-15B-Thinker with different SSM-to-MHA ratios and analyse how reasoning performance degrades as more Mamba layers replace MHA. Additionally, we release a 30/50 hybrid variant of Apriel-H1, further fine-tuned on a supervised dataset of reasoning traces, achieving over 2x higher inference throughput when deployed in the production-ready vLLM environment, with minimal degradation in reasoning performance. This shows that distilled hybrid SSM-Transformer architectures can deliver substantial efficiency gains over the pretrained transformer equivalent without substantially compromising the reasoning quality.

Apriel-H1: Towards Efficient Enterprise Reasoning Models

TL;DR

The paper tackles the bottleneck of quadratic attention in large transformers by introducing Apriel-H1 hybrids that fuse transformer attention with linear State Space Model (SSM) mixers (Mamba) at 15B, distilled from a reasoning teacher. Through a staged, reverse-KL distillation pipeline and layer-importance heuristics (LOO and MIL-MMR), the authors progressively replace MHA layers with Mamba blocks and fine-tune to preserve reasoning quality. The resulting hybrids achieve substantial inference throughput gains—up to around higher throughput with minimal degradation on reasoning benchmarks—while maintaining competitive reasoning capabilities, particularly when combined with supervised fine-tuning (SFT). The work highlights the practicality of distillation-based hybridization over from-scratch pretraining due to its favorable data and compute balance, and outlines future directions for more advanced linear mixers and adaptive layer scheduling to further improve the efficiency/accuracy frontier.

Abstract

Large Language Models (LLMs) achieve remarkable reasoning capabilities through transformer architectures with attention mechanisms. However, transformers suffer from quadratic time and memory complexity in the attention module (MHA) and require caching key-value states during inference, which severely limits throughput and scalability. High inference throughput is critical for agentic tasks, long-context reasoning, efficient deployment under high request loads, and more efficient test-time compute scaling. State Space Models (SSMs) such as Mamba offer a promising alternative with linear inference complexity and a constant memory footprint via recurrent computation with fixed-size hidden states. In this technical report we introduce the Apriel-H1 family of hybrid LLMs that combine transformer attention and SSM sequence mixers for efficient reasoning at 15B model size. These models are obtained through incremental distillation from a pretrained reasoning transformer, Apriel-Nemotron-15B-Thinker, progressively replacing less critical attention layers with linear Mamba blocks. We release multiple post-distillation variants of Apriel-H1-15B-Thinker with different SSM-to-MHA ratios and analyse how reasoning performance degrades as more Mamba layers replace MHA. Additionally, we release a 30/50 hybrid variant of Apriel-H1, further fine-tuned on a supervised dataset of reasoning traces, achieving over 2x higher inference throughput when deployed in the production-ready vLLM environment, with minimal degradation in reasoning performance. This shows that distilled hybrid SSM-Transformer architectures can deliver substantial efficiency gains over the pretrained transformer equivalent without substantially compromising the reasoning quality.

Paper Structure

This paper contains 14 sections, 5 equations, 5 figures.

Figures (5)

  • Figure 1: Layer importance ($\uparrow$) using LOO for the Apriel-Nemotron-15B-Thinker model Apriel-nemotron-15b-thinker.
  • Figure 2: Layer importance MMR ($\uparrow$) before distillation (0 steps) and after 100 distillation steps. Crossing horizontal lines visualize the change in layer importance ranking.
  • Figure 3: (left) Comparison of evaluation metrics between Apriel-Nemotron-15b-Thinker vs. Apriel-H1-30/50-15b-Thinker-SFT. (right) The H variant more than doubles the throughput without with minimal drop in performance across a wide range of tasks under a typical reasoning load using vLLM back-end.
  • Figure 4: Performance vs. Throughput trade-off for Apriel-H1-15B-Thinker variants (H1-25 to H1-40), Apriel-15B-Thinker (transformer) and other open-weights hybrids measured using vLLM backend. We reveal the total number of tokens used to obtain each of the Apriel-H1 models from the Apriel-15B-Thinker below each of the models. $\textcolor{red}{^*}$ H1-27 model has less tokens than H1-25 one because because it used an earlier checkpoint of H1-25 model as its starting point than the checkpoint plotted here for the H1-25. Apriel-H1-30/50-Thinker-SFT -- H1-30-SFT, is a version of the post-distillation H1-30 model further fine-tuned on a dataset of high-quality reasoning traces and merged with a version of H1-30 model (after 55.9B tokens of distillation). The linear decay in model performance as the number of SSM mixers is increased highlights a smooth trade-off between speed and performance. We also include the most recent Nemotron-Nano-9B-v2 modelbasant2025nvidia, that dominates the Pareto frontier in this plot. Importantly, this model has been obtained through direct pre-training of 12B hybrid (20T tokens) on high-quality data, and subsequently post-training using SFT and GRPO phases, and an additional punning phase where the model was pruned from 12B to 9B. We do not include other existing SSM-Transformer hybrid reasoners like M1 wang2025m1 and Jet-Nemotron gu2025jet due to their small size and absent implementation in the recent vLLM version, we also did not include other models that we could not fit on a single H100 GPU for inference in the 1 input and 16k output tokens scenario.
  • Figure :