Table of Contents
Fetching ...

Compact Language Models via Pruning and Knowledge Distillation

Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov

TL;DR

This work tackles the compute burden of producing deployment-scale LLMs by showing that pruning a single large model and retraining with a small data fraction, guided by knowledge distillation, can yield multiple smaller models with competitive accuracy. It introduces activation-based importance metrics and a lightweight neural architecture search to jointly prune depth and width, culminating in the Minitron family derived from Nemotron-4 15B. The approach delivers up to 40x fewer training tokens and 1.8x FLOPs savings for the model family while outperforming several depth/width-pruned baselines and matching or exceeding comparable community models. The study also provides a concrete set of best practices and demonstrates practical instruction-tuning results, with open-source weights and supplementary materials available. This has significant practical impact by reducing data and compute requirements for deploying varied-sized LLMs without sacrificing performance.

Abstract

Large language models (LLMs) targeting different deployment scales and sizes are currently produced by training each variant from scratch; this is extremely compute-intensive. In this paper, we investigate if pruning an existing LLM and then re-training it with a fraction (<3%) of the original training data can be a suitable alternative to repeated, full retraining. To this end, we develop a set of practical and effective compression best practices for LLMs that combine depth, width, attention and MLP pruning with knowledge distillation-based retraining; we arrive at these best practices through a detailed empirical exploration of pruning strategies for each axis, methods to combine axes, distillation strategies, and search techniques for arriving at optimal compressed architectures. We use this guide to compress the Nemotron-4 family of LLMs by a factor of 2-4x, and compare their performance to similarly-sized models on a variety of language modeling tasks. Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch; this results in compute cost savings of 1.8x for training the full model family (15B, 8B, and 4B). Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch, perform comparably to other community models such as Mistral 7B, Gemma 7B and Llama-3 8B, and outperform state-of-the-art compression techniques from the literature. We have open-sourced Minitron model weights on Huggingface, with corresponding supplementary material including example code available on GitHub.

Compact Language Models via Pruning and Knowledge Distillation

TL;DR

This work tackles the compute burden of producing deployment-scale LLMs by showing that pruning a single large model and retraining with a small data fraction, guided by knowledge distillation, can yield multiple smaller models with competitive accuracy. It introduces activation-based importance metrics and a lightweight neural architecture search to jointly prune depth and width, culminating in the Minitron family derived from Nemotron-4 15B. The approach delivers up to 40x fewer training tokens and 1.8x FLOPs savings for the model family while outperforming several depth/width-pruned baselines and matching or exceeding comparable community models. The study also provides a concrete set of best practices and demonstrates practical instruction-tuning results, with open-source weights and supplementary materials available. This has significant practical impact by reducing data and compute requirements for deploying varied-sized LLMs without sacrificing performance.

Abstract

Large language models (LLMs) targeting different deployment scales and sizes are currently produced by training each variant from scratch; this is extremely compute-intensive. In this paper, we investigate if pruning an existing LLM and then re-training it with a fraction (<3%) of the original training data can be a suitable alternative to repeated, full retraining. To this end, we develop a set of practical and effective compression best practices for LLMs that combine depth, width, attention and MLP pruning with knowledge distillation-based retraining; we arrive at these best practices through a detailed empirical exploration of pruning strategies for each axis, methods to combine axes, distillation strategies, and search techniques for arriving at optimal compressed architectures. We use this guide to compress the Nemotron-4 family of LLMs by a factor of 2-4x, and compare their performance to similarly-sized models on a variety of language modeling tasks. Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch; this results in compute cost savings of 1.8x for training the full model family (15B, 8B, and 4B). Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch, perform comparably to other community models such as Mistral 7B, Gemma 7B and Llama-3 8B, and outperform state-of-the-art compression techniques from the literature. We have open-sourced Minitron model weights on Huggingface, with corresponding supplementary material including example code available on GitHub.
Paper Structure (24 sections, 8 equations, 9 figures, 20 tables)

This paper contains 24 sections, 8 equations, 9 figures, 20 tables.

Figures (9)

  • Figure 1: Results for Minitron. Compression results in significant reduction of training costs for additional models ($40\times$) while producing better results.
  • Figure 2: High-level overview of our proposed iterative pruning and distillation approach to train a family of smaller LLMs. On a pretrained LLM, we first evaluate importance of neurons, rank them, trim the least important neurons and distill the knowledge from the original LLM to the pruned model. The original model is replaced with the distilled model for the next iteration of compression.
  • Figure 3: Overview of our neural architecture search algorithm. We perform a search on multiple axes: number of layers, attention head count, MLP and embedding dimensions to arrive at a set of feasible architectures meeting the parameter budget. RT refers to retraining.
  • Figure 4: Overview of Distillation. A $student$ model with $N$ layers is distilled from a $teacher$ model with $M$ layers. The $student$ learns by minimizing a combination of embedding output loss, logit loss and transformer encoder specific losses mapped across $student$ block $S$ and $teacher$ block $T$.
  • Figure 5: LM validation loss curve for retraining of two pruned candidates with (L2, L2) and (L2, Mean) metrics for (batch, sequence) aggregation strategies.
  • ...and 4 more figures