Table of Contents
Fetching ...

Boomerang Distillation Enables Zero-Shot Model Size Interpolation

Sara Kangaslahti, Nihal V. Nayak, Jonathan Geuter, Marco Fumero, Francesco Locatello, David Alvarez-Melis

TL;DR

Boomerang distillation introduces a zero shot interpolation mechanism that creates intermediate-size transformer models from a single teacher–student pair by patching contiguous teacher blocks into a distilled student. The method combines careful student initialization, cross entropy, KL divergence, and cosine alignment losses, enabling the deterministic construction of models with sizes between the student and teacher. Empirically, interpolated models match or surpass intermediate models trained via standard distillation and outperform layer-pruning baselines, with strong generalization across Qwen, Pythia, Llama, and off the shelf DistilBERT and DistilGPT2. This approach dramatically reduces training cost for fine-grained model families and provides a practical path for deploying models under diverse hardware constraints.

Abstract

Large language models (LLMs) are typically deployed under diverse memory and compute constraints. Existing approaches build model families by training each size independently, which is prohibitively expensive and provides only coarse-grained size options. In this work, we identify a novel phenomenon that we call boomerang distillation: starting from a large base model (the teacher), one first distills down to a small student and then progressively reconstructs intermediate-sized models by re-incorporating blocks of teacher layers into the student without any additional training. This process produces zero-shot interpolated models of many intermediate sizes whose performance scales smoothly between the student and teacher, often matching or surpassing pretrained or distilled models of the same size. We further analyze when this type of interpolation succeeds, showing that alignment between teacher and student through pruning and distillation is essential. Boomerang distillation thus provides a simple and efficient way to generate fine-grained model families, dramatically reducing training cost while enabling flexible adaptation across deployment environments. The code and models are available at https://github.com/dcml-lab/boomerang-distillation.

Boomerang Distillation Enables Zero-Shot Model Size Interpolation

TL;DR

Boomerang distillation introduces a zero shot interpolation mechanism that creates intermediate-size transformer models from a single teacher–student pair by patching contiguous teacher blocks into a distilled student. The method combines careful student initialization, cross entropy, KL divergence, and cosine alignment losses, enabling the deterministic construction of models with sizes between the student and teacher. Empirically, interpolated models match or surpass intermediate models trained via standard distillation and outperform layer-pruning baselines, with strong generalization across Qwen, Pythia, Llama, and off the shelf DistilBERT and DistilGPT2. This approach dramatically reduces training cost for fine-grained model families and provides a practical path for deploying models under diverse hardware constraints.

Abstract

Large language models (LLMs) are typically deployed under diverse memory and compute constraints. Existing approaches build model families by training each size independently, which is prohibitively expensive and provides only coarse-grained size options. In this work, we identify a novel phenomenon that we call boomerang distillation: starting from a large base model (the teacher), one first distills down to a small student and then progressively reconstructs intermediate-sized models by re-incorporating blocks of teacher layers into the student without any additional training. This process produces zero-shot interpolated models of many intermediate sizes whose performance scales smoothly between the student and teacher, often matching or surpassing pretrained or distilled models of the same size. We further analyze when this type of interpolation succeeds, showing that alignment between teacher and student through pruning and distillation is essential. Boomerang distillation thus provides a simple and efficient way to generate fine-grained model families, dramatically reducing training cost while enabling flexible adaptation across deployment environments. The code and models are available at https://github.com/dcml-lab/boomerang-distillation.

Paper Structure

This paper contains 73 sections, 8 equations, 34 figures, 3 tables.

Figures (34)

  • Figure 1: Overview of boomerang distillation. ➀ In this example, the student model is initialized by dropping layers from the pretrained teacher model. ➁ The teacher model is distilled into the student model with cross-entropy loss, knowledge distillation loss, and cosine distance loss (Equation \ref{['eq:overall-loss']}). ➂ After training the student model, a block of teacher layers corresponding to a student layer is inserted back into the model to get the zero-shot interpolated model.
  • Figure 2: Boomerang distillation produces models with smooth size–performance interpolation, consistently outperforming naive layer pruning and interpolation from randomly initialized distilled models. These results indicate that effective interpolation depends on initializing the student with teacher weights and training under a knowledge distillation objective.
  • Figure 3: Boomerang distillation emerges across model families. Shown here for Qwen3-8B, Pythia-6.9B, and Llama-3.2-3B, boomerang interpolation yields intermediate models with smooth accuracy–parameter scaling, outperforming naive layer pruning and random interpolation baselines.
  • Figure 4: Interpolated models produced using boomerang distillation have comparable performance to pretrained and standard distilled models. We compare the interpolation performance of boomerang distillation to distilled models initialized with the corresponding teacher layers and distilled using Equation \ref{['eq:overall-loss']}. At small sizes, the interpolated models have comparable performance to distilled and pretrained models. At larger sizes, the interpolated models outperform distilled models, likely due to catastrophic forgetting caused by distilling on a presumably lower-quality corpus.
  • Figure 5: Per-layer loss yields stable and smoother interpolation performance. Models distilled with per-layer cosine distance loss have smoother interpolation behavior across all model sizes. However, boomerang distillation still occurs for models without per-layer cosine distance loss, indicating that initialization using teacher layers provides substantial alignment information.
  • ...and 29 more figures