Table of Contents
Fetching ...

Autoregressive Knowledge Distillation through Imitation Learning

Alexander Lin, Jeremy Wohlwend, Howard Chen, Tao Lei

TL;DR

The paper tackles the challenge of deploying high-performing autoregressive NLG models by introducing an imitation-learning–based knowledge distillation framework that addresses exposure bias. By unifying KD with imitation learning and leveraging strategies like behavioral cloning and DAgger, it provides a principled approach to distill large teachers into smaller, faster students. The proposed method yields BLEU/ROUGE gains (1.4–4.8 points) and can achieve up to 14x faster inference compared to the teacher, outperforming standard seqKD on translation and summarization tasks. This work offers a practical pathway to deploy strong NLG models in time-sensitive settings without sacrificing quality.

Abstract

The performance of autoregressive models on natural language generation tasks has dramatically improved due to the adoption of deep, self-attentive architectures. However, these gains have come at the cost of hindering inference speed, making state-of-the-art models cumbersome to deploy in real-world, time-sensitive settings. We develop a compression technique for autoregressive models that is driven by an imitation learning perspective on knowledge distillation. The algorithm is designed to address the exposure bias problem. On prototypical language generation tasks such as translation and summarization, our method consistently outperforms other distillation algorithms, such as sequence-level knowledge distillation. Student models trained with our method attain 1.4 to 4.8 BLEU/ROUGE points higher than those trained from scratch, while increasing inference speed by up to 14 times in comparison to the teacher model.

Autoregressive Knowledge Distillation through Imitation Learning

TL;DR

The paper tackles the challenge of deploying high-performing autoregressive NLG models by introducing an imitation-learning–based knowledge distillation framework that addresses exposure bias. By unifying KD with imitation learning and leveraging strategies like behavioral cloning and DAgger, it provides a principled approach to distill large teachers into smaller, faster students. The proposed method yields BLEU/ROUGE gains (1.4–4.8 points) and can achieve up to 14x faster inference compared to the teacher, outperforming standard seqKD on translation and summarization tasks. This work offers a practical pathway to deploy strong NLG models in time-sensitive settings without sacrificing quality.

Abstract

The performance of autoregressive models on natural language generation tasks has dramatically improved due to the adoption of deep, self-attentive architectures. However, these gains have come at the cost of hindering inference speed, making state-of-the-art models cumbersome to deploy in real-world, time-sensitive settings. We develop a compression technique for autoregressive models that is driven by an imitation learning perspective on knowledge distillation. The algorithm is designed to address the exposure bias problem. On prototypical language generation tasks such as translation and summarization, our method consistently outperforms other distillation algorithms, such as sequence-level knowledge distillation. Student models trained with our method attain 1.4 to 4.8 BLEU/ROUGE points higher than those trained from scratch, while increasing inference speed by up to 14 times in comparison to the teacher model.

Paper Structure

This paper contains 10 sections, 1 theorem, 11 equations.

Key Result

Theorem 1

Let $\hat{\pi}$ be a policy such that $L_{BC}(\hat{\pi}) \leq \epsilon$. Then, $J(\hat{\pi}) \leq J(\pi^*) + T^2 \epsilon$.

Theorems & Definitions (1)

  • Theorem 1