Improving Autoregressive Training with Dynamic Oracles

Jianing Yang; Harshine Visvanathan; Yilin Wang; Xinyi Hu; Matthew Gormley

Improving Autoregressive Training with Dynamic Oracles

Jianing Yang, Harshine Visvanathan, Yilin Wang, Xinyi Hu, Matthew Gormley

TL;DR

The paper addresses exposure bias and metric misalignment in autoregressive sequence training by integrating DAgger with metric-specific dynamic oracles. It develops exact dynamic oracles for decomposable metrics like partial and exact F1, and approximate dynamic oracles for non-decomposable metrics such as ROUGE and BLEU using beam search, preserving no-regret guarantees for decomposable cases. Empirically, DAgger with these dynamic oracles improves partial F1 on NER and ROUGE on summarization, while MT (BLEU) results are mixed, sometimes not exceeding strong baselines. The work provides a practical, metric-aware training paradigm that can be extended to other metrics and model families, with runtime considerations and future directions for stochastic oracles and broader metric support.

Abstract

Many tasks within NLP can be framed as sequential decision problems, ranging from sequence tagging to text generation. However, for many tasks, the standard training methods, including maximum likelihood (teacher forcing) and scheduled sampling, suffer from exposure bias and a mismatch between metrics employed during training and inference. DAgger provides a solution to mitigate these problems, yet it requires a metric-specific dynamic oracle algorithm, which does not exist for many common metrics like span-based F1, ROUGE, and BLEU. In this paper, we develop these novel dynamic oracles and show they maintain DAgger's no-regret guarantee for decomposable metrics like span-based F1. We evaluate the algorithm's performance on named entity recognition (NER), text summarization, and machine translation (MT). While DAgger with dynamic oracle yields less favorable results in our MT experiments, it outperforms the baseline techniques in NER and text summarization.

Improving Autoregressive Training with Dynamic Oracles

TL;DR

Abstract

Paper Structure (43 sections, 2 equations, 3 figures, 3 tables, 3 algorithms)

This paper contains 43 sections, 2 equations, 3 figures, 3 tables, 3 algorithms.

Introduction
Methods
DAgger
Dynamic Oracles
Exact and Partial F1
ROUGE and BLEU
Experiments
F1 / Named Entity Recognition
Dataset
Experiment Details
Results
BLEU / Machine Translation
Dataset
Experiment Details
Results
...and 28 more sections

Figures (3)

Figure 1: (a) Illustration of issues encountered in standard sequence training and the consequence. (b) Problems faced by Teacher Forcing, Scheduled Sampling and DAgger.
Figure 2: Dynamic oracle produces better supervision than vanilla scheduled sampling for training autoregressive decoder. In the above example, when using scheduled sampling to train a decoder, the word "the" was not predicted by the decoder in the prefix sequence. In this case, what supervision should one use for the red box? Vanilla scheduled sampling uses the supervision "so we should submit conference paper by by midnight" (notice "by" appears twice) which leads to a BLEU of 36.9; whereas dynamic oracle uses the supervision "so we should submit conference paper by midnight", which leads to a BLEU of 37.7. The dynamic oracle gives better supervision. Detailed explanation on scheduled sampling: at each autoregressive decoding step, one flips a coin to decide if a ground-truth token or a model-predicted token should be used as the prefix. As a result, the token "the" happens to be not in the input prefix.
Figure 3: Larger beam sizes and initiating DAgger training earlier results in better dynamic oracle quality. (a) The vertical axis indicates the percentage of instances where the dynamic oracle's completion yields a higher BLEU score than that of the ground truth. A higher percentage implies greater benefits from transitioning from teacher forcing to the dynamic oracle. Importantly, by design, the dynamic oracle's BLEU cannot be lower than the ground truth--it always selects the better word between the ground truth and the beam search result. (b) Comparison of average BLEU scores for ground truth and dynamic oracle supervisions, averaged over an entire batch. Beam sizes of 5, 20, 90, and 1000 are employed for the dynamic oracle.

Improving Autoregressive Training with Dynamic Oracles

TL;DR

Abstract

Improving Autoregressive Training with Dynamic Oracles

Authors

TL;DR

Abstract

Table of Contents

Figures (3)