Table of Contents
Fetching ...

In Good GRACEs: Principled Teacher Selection for Knowledge Distillation

Abhishek Panigrahi, Bingbin Liu, Sadhika Malladi, Sham Kakade, Surbhi Goel

TL;DR

This work tackles the costly problem of selecting the best teacher for distilling autoregressive language models. It introduces GRACE, a lightweight, verifier-free score that uses a gradient-based, cross-validated analysis of teacher-generated data to predict how well a given teacher will support a student, with theoretical ties to leave-one-out conditional mutual information. GRACE combines gradient-direction diversity and gradient-magnitude alignment (via a spectrum-weighted second-moment preconditioner) to identify strongly compatible teachers and provide actionable guidance on generation temperature, scale constraints, and model-family choices. Extensive experiments on GSM8K and MATH show that GRACE achieves high correlation with post-distillation performance and minimal teacher-student regret across multiple student and teacher families, outperforming prior gradient-based baselines. The approach offers a practical, generalizable tool for distillation, with promising results in out-of-distribution and non-math domains, and clear directions for future theoretical and empirical expansion.

Abstract

Knowledge distillation is an efficient strategy to use data generated by large "teacher" language models to train smaller capable "student" models, but selecting the optimal teacher for a specific student-task combination requires expensive trial-and-error. We propose a lightweight score called GRACE to quantify how effective a teacher will be for post-training a student model. GRACE measures distributional properties of the student's gradients without access to a verifier, teacher logits, teacher internals, or test data. From an information-theoretic perspective, GRACE connects to leave-one-out stability of gradient-based algorithms, which controls the generalization performance of the distilled students. On GSM8K and MATH, GRACE correlates strongly (up to 86% Spearman correlation) with the performance of the distilled LLaMA and OLMo students. In particular, training a student using the GRACE-selected teacher can improve the performance by up to 7.4% over naively using the best-performing teacher. Further, GRACE can provide guidance on crucial design choices in distillation, including (1) the best temperature to use when generating from the teacher, (2) the best teacher to use given a size constraint, and (3) the best teacher to use within a specific model family. Altogether, our findings demonstrate that GRACE can efficiently and effectively identify a strongly compatible teacher for a given student and provide fine-grained guidance on how to perform distillation.

In Good GRACEs: Principled Teacher Selection for Knowledge Distillation

TL;DR

This work tackles the costly problem of selecting the best teacher for distilling autoregressive language models. It introduces GRACE, a lightweight, verifier-free score that uses a gradient-based, cross-validated analysis of teacher-generated data to predict how well a given teacher will support a student, with theoretical ties to leave-one-out conditional mutual information. GRACE combines gradient-direction diversity and gradient-magnitude alignment (via a spectrum-weighted second-moment preconditioner) to identify strongly compatible teachers and provide actionable guidance on generation temperature, scale constraints, and model-family choices. Extensive experiments on GSM8K and MATH show that GRACE achieves high correlation with post-distillation performance and minimal teacher-student regret across multiple student and teacher families, outperforming prior gradient-based baselines. The approach offers a practical, generalizable tool for distillation, with promising results in out-of-distribution and non-math domains, and clear directions for future theoretical and empirical expansion.

Abstract

Knowledge distillation is an efficient strategy to use data generated by large "teacher" language models to train smaller capable "student" models, but selecting the optimal teacher for a specific student-task combination requires expensive trial-and-error. We propose a lightweight score called GRACE to quantify how effective a teacher will be for post-training a student model. GRACE measures distributional properties of the student's gradients without access to a verifier, teacher logits, teacher internals, or test data. From an information-theoretic perspective, GRACE connects to leave-one-out stability of gradient-based algorithms, which controls the generalization performance of the distilled students. On GSM8K and MATH, GRACE correlates strongly (up to 86% Spearman correlation) with the performance of the distilled LLaMA and OLMo students. In particular, training a student using the GRACE-selected teacher can improve the performance by up to 7.4% over naively using the best-performing teacher. Further, GRACE can provide guidance on crucial design choices in distillation, including (1) the best temperature to use when generating from the teacher, (2) the best teacher to use given a size constraint, and (3) the best teacher to use within a specific model family. Altogether, our findings demonstrate that GRACE can efficiently and effectively identify a strongly compatible teacher for a given student and provide fine-grained guidance on how to perform distillation.

Paper Structure

This paper contains 53 sections, 4 theorems, 33 equations, 29 figures, 8 tables.

Key Result

Lemma 1

Define GRACE with $C = n$. For any $\mathcal{D}'$, take $\mathbf{M}( \mathcal{D}', \Theta) := \Hat{{\boldsymbol\Sigma}}(\mathcal{D}')^{-1/2}$, where then CMI $\lesssim \frac{1}{\sigma^2 n^2}$ GRACE-Variance $(\mathcal{D})$$\lesssim \frac{1}{\sigma^2 n^2}$ GRACE $(\mathcal{D})$.

Figures (29)

  • Figure 1: GRACE correlates with student performance after distillation on math-related reasoning tasks. We evaluate LLaMA-1B and LLaMA-3B students on GSM8K and MATH respectively with 15 teachers across LLaMA, Gemma, Qwen, OLMo, and Phi families. (Left) GRACE shows the strongest correlation with student accuracy among four scores---teacher performance, student's loss (before training) on the teacher generations, G-Vendi, and GRACE. (Right) GRACE reliably selects near-optimal teachers within each teacher family, measured by its small teacher-student regret, which is the absolute gap in final performance between the best overall student and that obtained from the teacher chosen by each score. Performance is measured by average accuracy over 16 generations per prompt.
  • Figure 2: GRACE achieves $86\%$ Spearman correlation to LLaMA-1B's post-distillation performance on GSM8K, much higher than G-Norm (53%) and G-Vendi (44%). When evaluated by teacher-student regret, GRACE selects a teacher with regret of $0.3\%$, outperforming G-Norm and G-Vendi, which incur regrets of $10.8\%$ and $14.5\%$, respectively. Stars denote students trained from the teacher chosen by each score. Gemma teachers are outliers, because they give extremely concise responses to each prompt. More discussion is in \ref{['app:response_analysis']}
  • Figure 3: GRACE achieves $74\%$ Spearman correlation to OLMo-1B's post-distillation performance on GSM8K, significantly outperforming G-Norm (41%) and G-Vendi (48%). When evaluated by teacher-student regret, GRACE selects a teacher with regret of $0.1\%$, outperforming G-Norm and G-Vendi, which incur regrets of $9.1\%$ and $8.2\%$, respectively. Stars denote students trained from the teacher chosen by each score. Similar observations hold for a Gemma-2B student (\ref{['fig:across_all_temps_gsm_gemma']}).
  • Figure 4: GRACE achieves the strongest correlation to student performance, among all scores for LLaMA-1B on GSM8K and LLaMA-3B on MATH. Blue bars represent gradient-based scores, green bars denote student logit-based scores on the training data, and the gray bar corresponds to teacher performance. Teacher performance and the student's loss on teacher generations (Loss) before training show only weak correlations. While G-Norm correlates well with student performance on MATH, it is significantly worse on GSM8K.
  • Figure 5: GRACE achieves the minimum regret, among all scores for LLaMA-1B on GSM8K and LLaMA-3B on MATH. Naively selecting the teacher with the best performance shows a regret of at least $7.7\%$ and $5.9\%$ in average-at-16 performance of student training on GSM8K and MATH respectively. On the other hand, GRACE achieves a regret of $0.3\%$ and $3.9\%$ on GSM8K and MATH respectively. In the left plot, scores that show $14.5$ regret select the same teacher, resulting in identical regret values.
  • ...and 24 more figures

Theorems & Definitions (5)

  • Lemma 1: Informal; cf \ref{['cor:precond']}
  • Lemma 2: Bounds for Pre-conditioned Gradient Descent
  • Corollary C.1: Connecting CMI to G-Norm
  • Corollary C.2: Connecting CMI to GRACE
  • proof