Table of Contents
Fetching ...

Knowledge Distillation with Training Wheels

Guanlin Liu, Anand Ramachandran, Tanmay Gangwani, Yan Fu, Abhinav Sethy

TL;DR

Knowledge Distillation with Training Wheels reframes KD as an entropy-regularized value-optimization problem solved via Path Consistency Learning, enabling a student to request teacher help at test time under budgeted constraints. The authors introduce a constrained reinforcement learning extension with a special decoding token $<\tau>$ and natural-language budget prompts, optimizing a secondary constraint $V_C$ alongside the primary value function $V^{\pi}$. Empirical results on translation and summarization show that this approach can satisfy teacher-use budgets while improving output quality and reducing latency, unveiling operating points unavailable to prior speculative decoding methods. The framework generalizes to integrating multiple sources of expertise and tools, offering a scalable way to blend automated generation with selective expert guidance.

Abstract

Knowledge distillation is used, in generative language modeling, to train a smaller student model using the help of a larger teacher model, resulting in improved capabilities for the student model. In this paper, we formulate a more general framework for knowledge distillation where the student learns from the teacher during training, and also learns to ask for the teacher's help at test-time following rules specifying test-time restrictions. Towards this, we first formulate knowledge distillation as an entropy-regularized value optimization problem. Adopting Path Consistency Learning to solve this, leads to a new knowledge distillation algorithm using on-policy and off-policy demonstrations. We extend this using constrained reinforcement learning to a framework that incorporates the use of the teacher model as a test-time reference, within constraints. In this situation, akin to a human learner, the model needs to learn not only the learning material, but also the relative difficulty of different sections to prioritize for seeking teacher help. We examine the efficacy of our method through experiments in translation and summarization tasks, observing trends in accuracy and teacher use, noting that our approach unlocks operating points not available to the popular Speculative Decoding approach.

Knowledge Distillation with Training Wheels

TL;DR

Knowledge Distillation with Training Wheels reframes KD as an entropy-regularized value-optimization problem solved via Path Consistency Learning, enabling a student to request teacher help at test time under budgeted constraints. The authors introduce a constrained reinforcement learning extension with a special decoding token and natural-language budget prompts, optimizing a secondary constraint alongside the primary value function . Empirical results on translation and summarization show that this approach can satisfy teacher-use budgets while improving output quality and reducing latency, unveiling operating points unavailable to prior speculative decoding methods. The framework generalizes to integrating multiple sources of expertise and tools, offering a scalable way to blend automated generation with selective expert guidance.

Abstract

Knowledge distillation is used, in generative language modeling, to train a smaller student model using the help of a larger teacher model, resulting in improved capabilities for the student model. In this paper, we formulate a more general framework for knowledge distillation where the student learns from the teacher during training, and also learns to ask for the teacher's help at test-time following rules specifying test-time restrictions. Towards this, we first formulate knowledge distillation as an entropy-regularized value optimization problem. Adopting Path Consistency Learning to solve this, leads to a new knowledge distillation algorithm using on-policy and off-policy demonstrations. We extend this using constrained reinforcement learning to a framework that incorporates the use of the teacher model as a test-time reference, within constraints. In this situation, akin to a human learner, the model needs to learn not only the learning material, but also the relative difficulty of different sections to prioritize for seeking teacher help. We examine the efficacy of our method through experiments in translation and summarization tasks, observing trends in accuracy and teacher use, noting that our approach unlocks operating points not available to the popular Speculative Decoding approach.

Paper Structure

This paper contains 19 sections, 17 equations, 2 figures, 1 table, 1 algorithm.

Figures (2)

  • Figure 1: Performance metrics for the translation task. The X-axis indicates output quality. The Y-axis represents the achieved latency
  • Figure 2: Performance metrics for the summariation task. The X-axis indicates output quality. The Y-axis represents the achieved latency