Table of Contents
Fetching ...

Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs

Ali Taghibakhshi, Sharath Turuvekere Sreenivas, Saurav Muralidharan, Ruisi Cai, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Yoshi Suhara, Oluwatobi Olabiyi, Daniel Korzekwa, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro, Ashwath Aithal, Nima Tajbakhsh, Pavlo Molchanov

TL;DR

Nemotron Elastic introduces the first elastic, reasoning-focused framework that learns a single parent model capable of yielding multiple nested sub-models without retraining. By coupling a router-guided architecture search with a two-stage extended-context curriculum, it achieves substantial token and deployment-memory savings while delivering competitive or superior reasoning performance across 6B, 9B, and 12B budgets extracted zero-shot from a 12B Nemotron Nano V2 parent. The approach relies on a hybrid Mamba-Attention backbone and employs end-to-end router learning, group-aware elastification, and knowledge distillation to balance accuracy and resource constraints. Practically, this yields a scalable, memory-efficient way to deploy diverse inference configurations from one training run, enabling flexible edge and cloud deployments with minimal overhead and cost.

Abstract

Training a family of large language models targeting multiple scales and deployment objectives is prohibitively expensive, requiring separate training runs for each different size. Recent work on model compression through pruning and knowledge distillation has reduced this cost; however, this process still incurs hundreds of billions of tokens worth of training cost per compressed model. In this paper, we present Nemotron Elastic, a framework for building reasoning-oriented LLMs, including hybrid Mamba-Attention architectures, that embed multiple nested submodels within a single parent model, each optimized for different deployment configurations and budgets. Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment without additional training or fine-tuning. We enable this functionality through an end-to-end trained router, tightly coupled to a two-stage training curriculum designed specifically for reasoning models. We additionally introduce group-aware SSM elastification that preserves Mamba's structural constraints, heterogeneous MLP elastification, normalized MSE-based layer importance for improved depth selection, and knowledge distillation enabling simultaneous multi-budget optimization. We apply Nemotron Elastic to the Nemotron Nano V2 12B model, simultaneously producing a 9B and a 6B model using only 110B training tokens; this results in over 360x cost reduction compared to training model families from scratch, and around 7x compared to SoTA compression techniques. Each of the nested models performs on par or better than the SoTA in accuracy. Moreover, unlike other compression methods, the nested capability of our approach allows having a many-in-one reasoning model that has constant deployment memory against the number of models in the family.

Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs

TL;DR

Nemotron Elastic introduces the first elastic, reasoning-focused framework that learns a single parent model capable of yielding multiple nested sub-models without retraining. By coupling a router-guided architecture search with a two-stage extended-context curriculum, it achieves substantial token and deployment-memory savings while delivering competitive or superior reasoning performance across 6B, 9B, and 12B budgets extracted zero-shot from a 12B Nemotron Nano V2 parent. The approach relies on a hybrid Mamba-Attention backbone and employs end-to-end router learning, group-aware elastification, and knowledge distillation to balance accuracy and resource constraints. Practically, this yields a scalable, memory-efficient way to deploy diverse inference configurations from one training run, enabling flexible edge and cloud deployments with minimal overhead and cost.

Abstract

Training a family of large language models targeting multiple scales and deployment objectives is prohibitively expensive, requiring separate training runs for each different size. Recent work on model compression through pruning and knowledge distillation has reduced this cost; however, this process still incurs hundreds of billions of tokens worth of training cost per compressed model. In this paper, we present Nemotron Elastic, a framework for building reasoning-oriented LLMs, including hybrid Mamba-Attention architectures, that embed multiple nested submodels within a single parent model, each optimized for different deployment configurations and budgets. Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment without additional training or fine-tuning. We enable this functionality through an end-to-end trained router, tightly coupled to a two-stage training curriculum designed specifically for reasoning models. We additionally introduce group-aware SSM elastification that preserves Mamba's structural constraints, heterogeneous MLP elastification, normalized MSE-based layer importance for improved depth selection, and knowledge distillation enabling simultaneous multi-budget optimization. We apply Nemotron Elastic to the Nemotron Nano V2 12B model, simultaneously producing a 9B and a 6B model using only 110B training tokens; this results in over 360x cost reduction compared to training model families from scratch, and around 7x compared to SoTA compression techniques. Each of the nested models performs on par or better than the SoTA in accuracy. Moreover, unlike other compression methods, the nested capability of our approach allows having a many-in-one reasoning model that has constant deployment memory against the number of models in the family.

Paper Structure

This paper contains 60 sections, 58 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Left: Accuracy across key reasoning and mathematical benchmarks. The accuracy shown is the average across all benchmarks: MATH-500, AIME-2024, AIME-2025, GPQA, LiveCodeBench v5, and MMLU-Pro. Right: Scaling analysis comparing Nemotron Elastic and Minitron-SSM as model family size grows. Nemotron Elastic maintains constant cost for tokens and deployment memory, while Minitron-SSM scales linearly.
  • Figure 2: Overview of the Nemotron-Elastic training and deployment pipeline.Training: For each training sample, data flows to both teacher and student models. A budget (parameter size: 6B, 9B, or 12B) is selected and passed to the router, which generates differentiable masks for the student model. Knowledge distillation from the model prior to elastification enables simultaneous optimization across all budget variants. Deployment: After training, all models are extracted zero-shot from a single elastic checkpoint: the full 12B model and nested sub-networks (9B and 6B) are immediately available without additional fine-tuning or re-training.