Table of Contents
Fetching ...

Calibration of Encoder Decoder Models for Neural Machine Translation

Aviral Kumar, Sunita Sarawagi

TL;DR

This paper examines calibration in attention-based NMT systems, revealing substantial miscalibration of token probabilities—even under teacher forcing—and identifies EOS miscalibration and attention uncertainty as primary causes. It introduces weighted calibration error (weighted ECE) to assess the full output distribution and develops a context-dependent recalibration model that accounts for attention entropy, input coverage, and token logits. The method yields significant reductions in calibration error, improves BLEU scores, and stabilizes BLEU across increasing beam sizes, providing more reliable beam-search predictions. These results highlight the practical importance of calibration for interpretable confidence and robust decoding in neural machine translation and offer a principled alternative to heuristic fixes like temperature scaling.

Abstract

We study the calibration of several state of the art neural machine translation(NMT) systems built on attention-based encoder-decoder models. For structured outputs like in NMT, calibration is important not just for reliable confidence with predictions, but also for proper functioning of beam-search inference. We show that most modern NMT models are surprisingly miscalibrated even when conditioned on the true previous tokens. Our investigation leads to two main reasons -- severe miscalibration of EOS (end of sequence marker) and suppression of attention uncertainty. We design recalibration methods based on these signals and demonstrate improved accuracy, better sequence-level calibration, and more intuitive results from beam-search.

Calibration of Encoder Decoder Models for Neural Machine Translation

TL;DR

This paper examines calibration in attention-based NMT systems, revealing substantial miscalibration of token probabilities—even under teacher forcing—and identifies EOS miscalibration and attention uncertainty as primary causes. It introduces weighted calibration error (weighted ECE) to assess the full output distribution and develops a context-dependent recalibration model that accounts for attention entropy, input coverage, and token logits. The method yields significant reductions in calibration error, improves BLEU scores, and stabilizes BLEU across increasing beam sizes, providing more reliable beam-search predictions. These results highlight the practical importance of calibration for interpretable confidence and robust decoding in neural machine translation and offer a principled alternative to heuristic fixes like temperature scaling.

Abstract

We study the calibration of several state of the art neural machine translation(NMT) systems built on attention-based encoder-decoder models. For structured outputs like in NMT, calibration is important not just for reliable confidence with predictions, but also for proper functioning of beam-search inference. We show that most modern NMT models are surprisingly miscalibrated even when conditioned on the true previous tokens. Our investigation leads to two main reasons -- severe miscalibration of EOS (end of sequence marker) and suppression of attention uncertainty. We design recalibration methods based on these signals and demonstrate improved accuracy, better sequence-level calibration, and more intuitive results from beam-search.

Paper Structure

This paper contains 21 sections, 14 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Reliability Plots for various baseline models on the test sets along with their ECE values(Blue). The x-axis is expected confidence after binning into 0.05 sized bins and y-axis is accuracy in that confidence bin. Reliability plots for calibrated (corrected) models (Red). ECE values in corresponding colors. Test sets are mentioned in the corresponding references.
  • Figure 2: Tokenwise Calibration plots for some of the models. Note the miscalibration of EOS vs the calibration of other tokens. All other tokens roughly show a similar trend as the overall calibration plot.
  • Figure 3: Tail and Head Calibration Plots for 3 models. Note that the head is overestimated in GNMT/NMT, underestimated in T2T and the tail shows the opposite trend. Here the x-axis corresponds to the log of the fraction of vocabulary that is classified as tail prediction.
  • Figure 4: Sequence level calibration plots for various models [Baseline + Corrected(Calibrated)]. The dotted lines shows the densities (fraction of all points) in each bin. Note that the density in all the cases shifts to the low end, showing that overestimation is reduced. This trend in the density is same across all models and the calibrated densities are more in agreement with the observed bleu on the datasets (test datasets).