Table of Contents
Fetching ...

Puzzle: Distillation-Based NAS for Inference-Optimized LLMs

Akhiad Bercovich, Tomer Ronen, Talor Abramovich, Nir Ailon, Nave Assaf, Mohammad Dabbah, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Netanel Haber, Ehud Karpas, Roi Koren, Itay Levy, Pavlo Molchanov, Shahar Mor, Zach Moshe, Najeeb Nabwani, Omri Puny, Ran Rubin, Itamar Schen, Ido Shahaf, Oren Tropp, Omer Ullman Argov, Ran Zilberstein, Ran El-Yaniv

TL;DR

Puzzle tackles the bottleneck of inference cost in large LLMs by introducing a hardware-aware decomposed NAS framework. It leverages Blockwise Local Distillation to build a versatile block library, then uses Mixed-Integer Programming to assemble non-uniform, hardware-tuned architectures under real-world constraints, followed by Global Knowledge Distillation to recover end-to-end performance. The approach yields Nemotron derivatives that deliver substantial throughput gains on a single GPU with minimal accuracy loss, and demonstrates robustness across datasets, contexts, and hardware platforms. This work provides a practical pathway to deploy powerful LLMs efficiently, highlighting that inference efficiency, not merely parameter count, should guide model selection and design.

Abstract

Large language models (LLMs) offer remarkable capabilities, yet their high inference costs restrict wider adoption. While increasing parameter counts improves accuracy, it also broadens the gap between state-of-the-art capabilities and practical deployability. We present Puzzle, a hardware-aware framework that accelerates the inference of LLMs while preserving their capabilities. Using neural architecture search (NAS) at a large-scale, Puzzle optimizes models with tens of billions of parameters. Our approach utilizes blockwise local knowledge distillation (BLD) for parallel architecture exploration and employs mixed-integer programming for precise constraint optimization. We showcase our framework's impact via Llama-3.1-Nemotron-51B-Instruct (Nemotron-51B) and Llama-3.3-Nemotron-49B, two publicly available models derived from Llama-70B-Instruct. Both models achieve a 2.17x inference throughput speedup, fitting on a single NVIDIA H100 GPU while retaining 98.4% of the original model's benchmark accuracies. These are the most accurate models supporting single H100 GPU inference with large batch sizes, despite training on 45B tokens at most, far fewer than the 15T used to train Llama-70B. Lastly, we show that lightweight alignment on these derived models allows them to surpass the parent model in specific capabilities. Our work establishes that powerful LLM models can be optimized for efficient deployment with only negligible loss in quality, underscoring that inference performance, not parameter count alone, should guide model selection.

Puzzle: Distillation-Based NAS for Inference-Optimized LLMs

TL;DR

Puzzle tackles the bottleneck of inference cost in large LLMs by introducing a hardware-aware decomposed NAS framework. It leverages Blockwise Local Distillation to build a versatile block library, then uses Mixed-Integer Programming to assemble non-uniform, hardware-tuned architectures under real-world constraints, followed by Global Knowledge Distillation to recover end-to-end performance. The approach yields Nemotron derivatives that deliver substantial throughput gains on a single GPU with minimal accuracy loss, and demonstrates robustness across datasets, contexts, and hardware platforms. This work provides a practical pathway to deploy powerful LLMs efficiently, highlighting that inference efficiency, not merely parameter count, should guide model selection and design.

Abstract

Large language models (LLMs) offer remarkable capabilities, yet their high inference costs restrict wider adoption. While increasing parameter counts improves accuracy, it also broadens the gap between state-of-the-art capabilities and practical deployability. We present Puzzle, a hardware-aware framework that accelerates the inference of LLMs while preserving their capabilities. Using neural architecture search (NAS) at a large-scale, Puzzle optimizes models with tens of billions of parameters. Our approach utilizes blockwise local knowledge distillation (BLD) for parallel architecture exploration and employs mixed-integer programming for precise constraint optimization. We showcase our framework's impact via Llama-3.1-Nemotron-51B-Instruct (Nemotron-51B) and Llama-3.3-Nemotron-49B, two publicly available models derived from Llama-70B-Instruct. Both models achieve a 2.17x inference throughput speedup, fitting on a single NVIDIA H100 GPU while retaining 98.4% of the original model's benchmark accuracies. These are the most accurate models supporting single H100 GPU inference with large batch sizes, despite training on 45B tokens at most, far fewer than the 15T used to train Llama-70B. Lastly, we show that lightweight alignment on these derived models allows them to surpass the parent model in specific capabilities. Our work establishes that powerful LLM models can be optimized for efficient deployment with only negligible loss in quality, underscoring that inference performance, not parameter count alone, should guide model selection.

Paper Structure

This paper contains 33 sections, 8 equations, 8 figures, 19 tables.

Figures (8)

  • Figure 1: An overview of the three stages of our Puzzle framework.
  • Figure 2: Blockwise local distillation (BLD): each block is trained in parallel and independently.
  • Figure 3: Coupled BLD requires training $|\mathcal{A}_i| \times |\mathcal{F}_i|$ variants per transformer layer, while decoupled BLD requires only $|\mathcal{A}_i| + |\mathcal{F}_i|$ variants per layer, significantly speeding up library construction.
  • Figure 4: Preference of human annotators in a blind test comparison. Results indicate comparable performance between Llama-3.1-70B-Instruct and Nemotron-51B.
  • Figure 5: Accuracy vs. Throughput performance of Nemotron-51B compared to state-of-the-art models. Throughput is measured on NVIDIA H100 GPUs with optimal TP setting per model, all running in FP8 on a "text generation" scenario (see Table \ref{['tab:nemotron_throughput']}). The red line represents the efficient frontier, highlighting models with the best accuracy-to-throughput tradeoff. Accuracy=(MT-Bench $\times$10 + MMLU) / 2
  • ...and 3 more figures