Calibration of Encoder Decoder Models for Neural Machine Translation
Aviral Kumar, Sunita Sarawagi
TL;DR
This paper examines calibration in attention-based NMT systems, revealing substantial miscalibration of token probabilities—even under teacher forcing—and identifies EOS miscalibration and attention uncertainty as primary causes. It introduces weighted calibration error (weighted ECE) to assess the full output distribution and develops a context-dependent recalibration model that accounts for attention entropy, input coverage, and token logits. The method yields significant reductions in calibration error, improves BLEU scores, and stabilizes BLEU across increasing beam sizes, providing more reliable beam-search predictions. These results highlight the practical importance of calibration for interpretable confidence and robust decoding in neural machine translation and offer a principled alternative to heuristic fixes like temperature scaling.
Abstract
We study the calibration of several state of the art neural machine translation(NMT) systems built on attention-based encoder-decoder models. For structured outputs like in NMT, calibration is important not just for reliable confidence with predictions, but also for proper functioning of beam-search inference. We show that most modern NMT models are surprisingly miscalibrated even when conditioned on the true previous tokens. Our investigation leads to two main reasons -- severe miscalibration of EOS (end of sequence marker) and suppression of attention uncertainty. We design recalibration methods based on these signals and demonstrate improved accuracy, better sequence-level calibration, and more intuitive results from beam-search.
