Table of Contents
Fetching ...

Uncertainty-Aware Decoding with Minimum Bayes Risk

Nico Daheim, Clara Meister, Thomas Möllenhoff, Iryna Gurevych

TL;DR

This work extends Minimum Bayes Risk (MBR) decoding to account for weight uncertainty in language generation by introducing a predictive posterior that marginalizes over a parameter distribution ${q(\boldsymbol{\theta})}$. By evaluating sequence- and token-level posteriors and deriving practical Monte Carlo estimators, the approach enables uncertainty-aware decoding, selective prediction, and ensembling of diverse models, including black-box LLMs. Empirical results across translation, summarization, and data-to-text tasks show consistent improvements in quality and reductions in hallucinations, with gains correlating with the diversity of the ensemble and scalable to larger hypothesis sets. The findings demonstrate that uncertainty-aware MBR can yield robust, near-zero overhead improvements and offer a principled framework for scalable, reliable generation and abstention decisions in real-world systems.

Abstract

Despite their outstanding performance in the majority of scenarios, contemporary language models still occasionally generate undesirable outputs, for example, hallucinated text. While such behaviors have previously been linked to uncertainty, there is a notable lack of methods that actively consider uncertainty during text generation. In this work, we show how Minimum Bayes Risk (MBR) decoding, which selects model generations according to an expected risk, can be generalized into a principled uncertainty-aware decoding method. In short, we account for model uncertainty during decoding by incorporating a posterior over model parameters into MBR's computation of expected risk. We show that this modified expected risk is useful for both choosing outputs and deciding when to abstain from generation and can provide improvements without incurring overhead. We benchmark different methods for learning posteriors and show that performance improves with prediction diversity. We release our code publicly.

Uncertainty-Aware Decoding with Minimum Bayes Risk

TL;DR

This work extends Minimum Bayes Risk (MBR) decoding to account for weight uncertainty in language generation by introducing a predictive posterior that marginalizes over a parameter distribution . By evaluating sequence- and token-level posteriors and deriving practical Monte Carlo estimators, the approach enables uncertainty-aware decoding, selective prediction, and ensembling of diverse models, including black-box LLMs. Empirical results across translation, summarization, and data-to-text tasks show consistent improvements in quality and reductions in hallucinations, with gains correlating with the diversity of the ensemble and scalable to larger hypothesis sets. The findings demonstrate that uncertainty-aware MBR can yield robust, near-zero overhead improvements and offer a principled framework for scalable, reliable generation and abstention decisions in real-world systems.

Abstract

Despite their outstanding performance in the majority of scenarios, contemporary language models still occasionally generate undesirable outputs, for example, hallucinated text. While such behaviors have previously been linked to uncertainty, there is a notable lack of methods that actively consider uncertainty during text generation. In this work, we show how Minimum Bayes Risk (MBR) decoding, which selects model generations according to an expected risk, can be generalized into a principled uncertainty-aware decoding method. In short, we account for model uncertainty during decoding by incorporating a posterior over model parameters into MBR's computation of expected risk. We show that this modified expected risk is useful for both choosing outputs and deciding when to abstain from generation and can provide improvements without incurring overhead. We benchmark different methods for learning posteriors and show that performance improves with prediction diversity. We release our code publicly.

Paper Structure

This paper contains 49 sections, 13 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Our methods are more successful when the ensembled models are diverse. We compare a unimodal to mixture-based posteriors using Snapshot Ensembles and Deep Ensembles. Sampling from a unimodal posterior with higher temperature can increase diversity and improve performance (in blue). Left: token-level combination on IWSLT14 using beam search and Transformer$_\text{base}$. Right: sequence-level combination (\ref{['eq:seq_level_mbr_estimator']}) on IWSLT17 using ancestral sampling and Gemma-2B.
  • Figure 2: Total risk and best-output-risk are useful for selective prediction. (a) Creating hypothesis sets with sampling performs better than beam search. (b) Increasing temperature when sampling from unimodal posteriors improves selective prediction. (c) When using beam search more Deep Ensembles work best. (d) For sampling, all methods work well. Results on IWSLT14 with Transformer$_\text{base}$.
  • Figure 3: Scaling behavior on IWSLT14 with Transformer$_\text{base}$ in terms of ensemble (a, b) and hypothesis set size (c, d). (a, b) For a unimodal posterior ($\square$), larger ensembles improve token-level combination using sampling but not beam search. For Deep Ensemble posteriors ($\circ$), larger ensembles generally improve performance. (c, d) Sequence-level combination (\ref{['eq:seq_level_mbr_estimator']}) performs better for smaller beam sizes but is outperformed by token-level combination at larger ones. Scaling the hypothesis set produces stronger improvements for ancestral sampling than beam search.