Table of Contents
Fetching ...

GUIDE: Guided Initialization and Distillation of Embeddings

Khoa Trinh, Gaurav Menghani, Erik Vee

TL;DR

GUIDE addresses the inefficiency of standard distillation by introducing a zero-overhead, parameter-space initialization that bridges a large teacher and a smaller student. It leverages PCA compression of the teacher's embeddings to initialize the student’s embedding table and uses a learned projection ${\bm{M}} \in \mathbb{R}^{d_T \times d_S}$ to map inputs into the teacher-aligned space, followed by Uniform Selection to resize remaining dimensions. Empirically, GUIDE achieves about $26.5\%$ and $25.1\%$ reductions in the teacher–student gap for 400M and 1B students respectively, and combines additively with Knowledge Distillation for even larger gains, all without increasing training or inference costs. This makes GUIDE a practically attractive method for leveraging large pre-trained teachers to improve smaller models in deployment-sensitive settings.

Abstract

Algorithmic efficiency techniques such as distillation (\cite{hinton2015distillation}) are useful in improving model quality without increasing serving costs, provided a larger teacher model is available for a smaller student model to learn from during training. Standard distillation methods are limited to only forcing the student to match the teacher's outputs. Given the costs associated with training a large model, we believe we should be extracting more useful information from a teacher model than by just making the student match the teacher's outputs. In this paper, we introduce \guide (Guided Initialization and Distillation of Embeddings). \guide can be considered a distillation technique that forces the student to match the teacher in the parameter space. Using \guide we show 25-26\% reduction in the teacher-student quality gap when using large student models (400M - 1B parameters) trained on $\approx$ 20B tokens. We also present a thorough analysis demonstrating that \guide can be combined with knowledge distillation with near additive improvements. Furthermore, we show that applying \guide alone leads to substantially better model quality than applying knowledge distillation by itself. Most importantly, \guide introduces no training or inference overhead and hence any model quality gains from our method are virtually free.

GUIDE: Guided Initialization and Distillation of Embeddings

TL;DR

GUIDE addresses the inefficiency of standard distillation by introducing a zero-overhead, parameter-space initialization that bridges a large teacher and a smaller student. It leverages PCA compression of the teacher's embeddings to initialize the student’s embedding table and uses a learned projection to map inputs into the teacher-aligned space, followed by Uniform Selection to resize remaining dimensions. Empirically, GUIDE achieves about and reductions in the teacher–student gap for 400M and 1B students respectively, and combines additively with Knowledge Distillation for even larger gains, all without increasing training or inference costs. This makes GUIDE a practically attractive method for leveraging large pre-trained teachers to improve smaller models in deployment-sensitive settings.

Abstract

Algorithmic efficiency techniques such as distillation (\cite{hinton2015distillation}) are useful in improving model quality without increasing serving costs, provided a larger teacher model is available for a smaller student model to learn from during training. Standard distillation methods are limited to only forcing the student to match the teacher's outputs. Given the costs associated with training a large model, we believe we should be extracting more useful information from a teacher model than by just making the student match the teacher's outputs. In this paper, we introduce \guide (Guided Initialization and Distillation of Embeddings). \guide can be considered a distillation technique that forces the student to match the teacher in the parameter space. Using \guide we show 25-26\% reduction in the teacher-student quality gap when using large student models (400M - 1B parameters) trained on 20B tokens. We also present a thorough analysis demonstrating that \guide can be combined with knowledge distillation with near additive improvements. Furthermore, we show that applying \guide alone leads to substantially better model quality than applying knowledge distillation by itself. Most importantly, \guide introduces no training or inference overhead and hence any model quality gains from our method are virtually free.

Paper Structure

This paper contains 15 sections, 10 equations, 6 figures, 8 tables, 2 algorithms.

Figures (6)

  • Figure 1: The teacher transformer can be "approximated" by using a PCA-compressed embedding table along with reconstructed embeddings.
  • Figure 2: GUIDE uses $M^T$ to transform $Q, K, V$ matrices. Then Uniform Selection is used to take evenly-spaced green rows and columns to initialize the student’s first block.
  • Figure 3: Training trajectory of the 400M parameter student model with GUIDE and other initialization methods. x-axis is the step number, and y-axis is the perplexity on the evaluation dataset.
  • Figure 4: Training trajectory of the 1B parameter student model with GUIDE and other initialization methods. x-axis is the step number, and y-axis is the perplexity on the evaluation dataset.
  • Figure 5: Combining Knowledge Distillation with GUIDE when training 400M model.
  • ...and 1 more figures