Apriel-H1: Towards Efficient Enterprise Reasoning Models

Oleksiy Ostapenko; Luke Kumar; Raymond Li; Denis Kocetkov; Joel Lamy-Poirier; Shruthan Radhakrishna; Soham Parikh; Shambhavi Mishra; Sebastien Paquet; Srinivas Sunkara; Valérie Bécaert; Sathwik Tejaswi Madhusudhan; Torsten Scholak

Apriel-H1: Towards Efficient Enterprise Reasoning Models

Oleksiy Ostapenko, Luke Kumar, Raymond Li, Denis Kocetkov, Joel Lamy-Poirier, Shruthan Radhakrishna, Soham Parikh, Shambhavi Mishra, Sebastien Paquet, Srinivas Sunkara, Valérie Bécaert, Sathwik Tejaswi Madhusudhan, Torsten Scholak

TL;DR

The paper tackles the bottleneck of quadratic attention in large transformers by introducing Apriel-H1 hybrids that fuse transformer attention with linear State Space Model (SSM) mixers (Mamba) at 15B, distilled from a reasoning teacher. Through a staged, reverse-KL distillation pipeline and layer-importance heuristics (LOO and MIL-MMR), the authors progressively replace MHA layers with Mamba blocks and fine-tune to preserve reasoning quality. The resulting hybrids achieve substantial inference throughput gains—up to around $3.4\times$ higher throughput with minimal degradation on reasoning benchmarks—while maintaining competitive reasoning capabilities, particularly when combined with supervised fine-tuning (SFT). The work highlights the practicality of distillation-based hybridization over from-scratch pretraining due to its favorable data and compute balance, and outlines future directions for more advanced linear mixers and adaptive layer scheduling to further improve the efficiency/accuracy frontier.

Abstract

Large Language Models (LLMs) achieve remarkable reasoning capabilities through transformer architectures with attention mechanisms. However, transformers suffer from quadratic time and memory complexity in the attention module (MHA) and require caching key-value states during inference, which severely limits throughput and scalability. High inference throughput is critical for agentic tasks, long-context reasoning, efficient deployment under high request loads, and more efficient test-time compute scaling. State Space Models (SSMs) such as Mamba offer a promising alternative with linear inference complexity and a constant memory footprint via recurrent computation with fixed-size hidden states. In this technical report we introduce the Apriel-H1 family of hybrid LLMs that combine transformer attention and SSM sequence mixers for efficient reasoning at 15B model size. These models are obtained through incremental distillation from a pretrained reasoning transformer, Apriel-Nemotron-15B-Thinker, progressively replacing less critical attention layers with linear Mamba blocks. We release multiple post-distillation variants of Apriel-H1-15B-Thinker with different SSM-to-MHA ratios and analyse how reasoning performance degrades as more Mamba layers replace MHA. Additionally, we release a 30/50 hybrid variant of Apriel-H1, further fine-tuned on a supervised dataset of reasoning traces, achieving over 2x higher inference throughput when deployed in the production-ready vLLM environment, with minimal degradation in reasoning performance. This shows that distilled hybrid SSM-Transformer architectures can deliver substantial efficiency gains over the pretrained transformer equivalent without substantially compromising the reasoning quality.

Apriel-H1: Towards Efficient Enterprise Reasoning Models

TL;DR

Abstract

Apriel-H1: Towards Efficient Enterprise Reasoning Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)