Optimal Completion Distillation for Sequence Learning
Sara Sabour, William Chan, Mohammad Norouzi
TL;DR
Optimal Completion Distillation (OCD) reframes seq2seq training by optimizing for edit-distance through exact optimal suffixes computed per generated prefix via dynamic programming. It constructs an optimal next-token policy from these Q-values and distills it into the model with a KL loss, avoiding MLE pretraining and joint likelihood objectives. On WSJ and Librispeech, OCD achieves state-of-the-art end-to-end speech results without language-model rescoring, demonstrating strong generalization and stability. The method is hyperparameter-free and accommodates on-policy or off-policy trajectories, with an efficient DP-based calculation for exact Q-values.
Abstract
We present Optimal Completion Distillation (OCD), a training procedure for optimizing sequence to sequence models based on edit distance. OCD is efficient, has no hyper-parameters of its own, and does not require pretraining or joint optimization with conditional log-likelihood. Given a partial sequence generated by the model, we first identify the set of optimal suffixes that minimize the total edit distance, using an efficient dynamic programming algorithm. Then, for each position of the generated sequence, we use a target distribution that puts equal probability on the first token of all the optimal suffixes. OCD achieves the state-of-the-art performance on end-to-end speech recognition, on both Wall Street Journal and Librispeech datasets, achieving $9.3\%$ WER and $4.5\%$ WER respectively.
