Table of Contents
Fetching ...

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, Han Cai

TL;DR

Jet-Nemotron addresses the efficiency-accuracy trade-off in large language models by proposing PostNAS, a post-training architecture search that freezes the MLPs of a pre-trained full-attention base and optimizes attention blocks. It introduces JetBlock, a dynamic, input-conditioned linear attention block, and uses hardware-aware NAS to tune KV cache size, head count, and dimension settings, achieving substantial throughput gains with minimal accuracy loss. Across MMLU(-Pro), math, retrieval, coding, and long-context benchmarks, Jet-Nemotron-2B matches or exceeds state-of-the-art full-attention models while delivering up to $53.6\times$ generation throughput on NVIDIA H100 at 256K context. By leveraging pre-trained bases and a coarse-to-fine search, PostNAS reduces the cost and risk of architectural exploration, enabling rapid deployment of efficient, high-performing language systems.

Abstract

We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput. Jet-Nemotron is developed using Post Neural Architecture Search (PostNAS), a novel neural architecture exploration pipeline that enables efficient model design. Unlike prior approaches, PostNAS begins with a pre-trained full-attention model and freezes its MLP weights, allowing efficient exploration of attention block designs. The pipeline includes four key components: (1) learning optimal full-attention layer placement and elimination, (2) linear attention block selection, (3) designing new attention blocks, and (4) performing hardware-aware hyperparameter search. Our Jet-Nemotron-2B model achieves comparable or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across a comprehensive suite of benchmarks while delivering up to 53.6x generation throughput speedup and 6.1x prefilling speedup. It also achieves higher accuracy on MMLU and MMLU-Pro than recent advanced MoE full-attention models, such as DeepSeek-V3-Small and Moonlight, despite their larger scale with 15B total and 2.2B activated parameters.

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

TL;DR

Jet-Nemotron addresses the efficiency-accuracy trade-off in large language models by proposing PostNAS, a post-training architecture search that freezes the MLPs of a pre-trained full-attention base and optimizes attention blocks. It introduces JetBlock, a dynamic, input-conditioned linear attention block, and uses hardware-aware NAS to tune KV cache size, head count, and dimension settings, achieving substantial throughput gains with minimal accuracy loss. Across MMLU(-Pro), math, retrieval, coding, and long-context benchmarks, Jet-Nemotron-2B matches or exceeds state-of-the-art full-attention models while delivering up to generation throughput on NVIDIA H100 at 256K context. By leveraging pre-trained bases and a coarse-to-fine search, PostNAS reduces the cost and risk of architectural exploration, enabling rapid deployment of efficient, high-performing language systems.

Abstract

We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput. Jet-Nemotron is developed using Post Neural Architecture Search (PostNAS), a novel neural architecture exploration pipeline that enables efficient model design. Unlike prior approaches, PostNAS begins with a pre-trained full-attention model and freezes its MLP weights, allowing efficient exploration of attention block designs. The pipeline includes four key components: (1) learning optimal full-attention layer placement and elimination, (2) linear attention block selection, (3) designing new attention blocks, and (4) performing hardware-aware hyperparameter search. Our Jet-Nemotron-2B model achieves comparable or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across a comprehensive suite of benchmarks while delivering up to 53.6x generation throughput speedup and 6.1x prefilling speedup. It also achieves higher accuracy on MMLU and MMLU-Pro than recent advanced MoE full-attention models, such as DeepSeek-V3-Small and Moonlight, despite their larger scale with 15B total and 2.2B activated parameters.

Paper Structure

This paper contains 21 sections, 6 figures, 16 tables.

Figures (6)

  • Figure 1: Comparison Between Jet-Nemotron and State-of-the-Art Efficient Language Models. The generation throughput is measured on the NVIDIA H100 GPU under a context length of 64K tokens. Jet-Nemotron-2B delivers a higher accuracy than Qwen3-1.7B-Base on MMLU-Pro while achieving 47$\times$ higher generation throughput. Jet-Nemotron-4B, despite its larger model size, still achieves higher generation throughput than all full-attention models with less than 2B parameters.
  • Figure 2: PostNAS Roadmap. Our pipeline starts from a pre-trained full-attention model and keeps the MLP frozen. It then performs a coarse-to-fine search for efficient attention block designs, first determining the optimal placement of full-attention layers, then selecting the best linear attention block or using a new linear attention block, and finally searching for optimal architectural hyperparameters.
  • Figure 3: PostNAS Accuracy Improvement Breakdown. By applying PostNAS to the baseline model, we achieve significant accuracy improvements across all benchmarks.
  • Figure 4: Learning to Place Full Attention with PostNAS. We train a once-for-all super network and perform beam search to identify the optimal placement of full attention layers.
  • Figure 5: (a) Layer Placement Search Results on Qwen2.5-1.5B. Each grid cell represents the search objective value of the corresponding attention layer; higher values indicate greater importance. (b) Comparison Between PostNAS and Uniform Placement.
  • ...and 1 more figures