Table of Contents
Fetching ...

On Teacher Hacking in Language Model Distillation

Daniil Tiapkin, Daniele Calandriello, Johan Ferret, Sarah Perrin, Nino Vieillard, Alexandre Ramé, Mathieu Blondel

TL;DR

The paper formalizes the notion of teacher hacking in language model distillation by contrasting an oracle-ground-truth distribution with a teacher and a student trained via soft distillation. It introduces golden (oracle–student) and proxy (teacher–student) metrics to track true-objective alignment versus proxy alignment, and uses a controlled, semi-synthetic setup to measure when distillation drifts from the ground truth. Experiments show that offline, fixed data can trigger teacher hacking (proxy improves while golden degrades), whereas online data generation preserves or improves both metrics; data diversity and larger generation budgets further mitigate hacking. The work provides practical guidelines for distillation pipelines to build robust, efficient LMs and highlights the risk of proxy optimization when relying on imperfect teachers. Overall, it contributes a diagnostic framework and actionable strategies to reduce adaptation to imperfect proxies in post-training LM pipelines.

Abstract

Post-training of language models (LMs) increasingly relies on the following two stages: (i) knowledge distillation, where the LM is trained to imitate a larger teacher LM, and (ii) reinforcement learning from human feedback (RLHF), where the LM is aligned by optimizing a reward model. In the second RLHF stage, a well-known challenge is reward hacking, where the LM over-optimizes the reward model. Such phenomenon is in line with Goodhart's law and can lead to degraded performance on the true objective. In this paper, we investigate whether a similar phenomenon, that we call teacher hacking, can occur during knowledge distillation. This could arise because the teacher LM is itself an imperfect approximation of the true distribution. To study this, we propose a controlled experimental setup involving: (i) an oracle LM representing the ground-truth distribution, (ii) a teacher LM distilled from the oracle, and (iii) a student LM distilled from the teacher. Our experiments reveal the following insights. When using a fixed offline dataset for distillation, teacher hacking occurs; moreover, we can detect it by observing when the optimization process deviates from polynomial convergence laws. In contrast, employing online data generation techniques effectively mitigates teacher hacking. More precisely, we identify data diversity as the key factor in preventing hacking. Overall, our findings provide a deeper understanding of the benefits and limitations of distillation for building robust and efficient LMs.

On Teacher Hacking in Language Model Distillation

TL;DR

The paper formalizes the notion of teacher hacking in language model distillation by contrasting an oracle-ground-truth distribution with a teacher and a student trained via soft distillation. It introduces golden (oracle–student) and proxy (teacher–student) metrics to track true-objective alignment versus proxy alignment, and uses a controlled, semi-synthetic setup to measure when distillation drifts from the ground truth. Experiments show that offline, fixed data can trigger teacher hacking (proxy improves while golden degrades), whereas online data generation preserves or improves both metrics; data diversity and larger generation budgets further mitigate hacking. The work provides practical guidelines for distillation pipelines to build robust, efficient LMs and highlights the risk of proxy optimization when relying on imperfect teachers. Overall, it contributes a diagnostic framework and actionable strategies to reduce adaptation to imperfect proxies in post-training LM pipelines.

Abstract

Post-training of language models (LMs) increasingly relies on the following two stages: (i) knowledge distillation, where the LM is trained to imitate a larger teacher LM, and (ii) reinforcement learning from human feedback (RLHF), where the LM is aligned by optimizing a reward model. In the second RLHF stage, a well-known challenge is reward hacking, where the LM over-optimizes the reward model. Such phenomenon is in line with Goodhart's law and can lead to degraded performance on the true objective. In this paper, we investigate whether a similar phenomenon, that we call teacher hacking, can occur during knowledge distillation. This could arise because the teacher LM is itself an imperfect approximation of the true distribution. To study this, we propose a controlled experimental setup involving: (i) an oracle LM representing the ground-truth distribution, (ii) a teacher LM distilled from the oracle, and (iii) a student LM distilled from the teacher. Our experiments reveal the following insights. When using a fixed offline dataset for distillation, teacher hacking occurs; moreover, we can detect it by observing when the optimization process deviates from polynomial convergence laws. In contrast, employing online data generation techniques effectively mitigates teacher hacking. More precisely, we identify data diversity as the key factor in preventing hacking. Overall, our findings provide a deeper understanding of the benefits and limitations of distillation for building robust and efficient LMs.

Paper Structure

This paper contains 41 sections, 4 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Overview of our controlled experimental setup. Usually, the teacher model is trained on expert data before being distilled into the student LM. In the controlled setup of this paper, the teacher is itself distilled from an additional oracle model. This oracle model allows us to measure the quality of the distillation process into the student, and to reveal "teacher hacking".
  • Figure 2: Overview of the training pipeline. Two stages: (1) prompts $x$ from a task-specific real dataset are used by the oracle model to generate the oracle pairs $(x,y)$, and afterwards, this dataset is used to get initial SFT checkpoints for both teacher and student model; (2) prompts from the same distribution are used to perform knowledge distillation, where the teacher model serves as a proxy to train the student model.
  • Figure 3: Overview of the evaluation pipeline. We use the validation prompt dataset to measure the golden metric (the distance between the oracle and the student models) and the proxy metric (the distance between the teacher and the student models).
  • Figure 4: Proxy-Golden plot (offline data source). We distill a T5-large teacher into a T5-base student on the XSUM dataset. The token-level training loss is the forward KL, the proxy metric is the distance to the teacher distribution and the golden metric is the distance to the ground-truth (oracle) distribution (available thanks to our semi-synthetic controlled experimental setup). In this plot, the $x$-axis (proxy metric) indicates optimization progress, and the $y$-axis shows the ground-truth performance (golden metric): lower is better. Teacher hacking occurs in the case of offline data source: the orange curve has a U-type shape, indicating that during optimization, the orange metric starts increasing, whereas the proxy metric continues to decrease.
  • Figure 5: Impact of using offline vs. online data sources. When using a fixed offline dataset, though the proxy metric continues to decrease, this is not visible in the golden metric, which continues to increase, a phenomenon we call teacher hacking. However, when using online response sampling, both from the teacher model or from the student model, this phenomenon does not occur.
  • ...and 8 more figures

Theorems & Definitions (1)

  • Definition 1: Teacher hacking