Non-Exchangeable Conformal Language Generation with Nearest Neighbors

Dennis Ulmer; Chrysoula Zerva; André F. T. Martins

Non-Exchangeable Conformal Language Generation with Nearest Neighbors

Dennis Ulmer, Chrysoula Zerva, André F. T. Martins

TL;DR

This work addresses uncertainty in non-i.i.d. neural language generation by adapting conformal prediction to a non-exchangeable setting and pairing it with nearest-neighbor retrieval. The proposed non-exchangeable conformal nucleus sampling yields token-level, calibrated prediction sets post-hoc, without additional training, by building a $k$-NN datastore of decoder states and conformity scores and shaping the set via a learned temperature. Across machine translation and language modeling, it achieves coverage close to the target with tighter prediction sets than baselines, and demonstrates robustness under distributional drift, aided by adaptive prediction sets. The approach offers a principled, scalable way to constrain generation with statistical guarantees and flexible uncertainty control, supported by open-source code and extensive empirical analysis.

Abstract

Quantifying uncertainty in automatically generated text is important for letting humans check potential hallucinations and making systems more reliable. Conformal prediction is an attractive framework to provide predictions imbued with statistical guarantees, however, its application to text generation is challenging since any i.i.d. assumptions are not realistic. In this paper, we bridge this gap by leveraging recent results on non-exchangeable conformal prediction, which still ensures bounds on coverage. The result, non-exchangeable conformal nucleus sampling, is a novel extension of the conformal prediction framework to generation based on nearest neighbors. Our method can be used post-hoc for an arbitrary model without extra training and supplies token-level, calibrated prediction sets equipped with statistical guarantees. Experiments in machine translation and language modeling show encouraging results in generation quality. By also producing tighter prediction sets with good coverage, we thus give a more theoretically principled way to perform sampling with conformal guarantees.

Non-Exchangeable Conformal Language Generation with Nearest Neighbors

TL;DR

-NN datastore of decoder states and conformity scores and shaping the set via a learned temperature. Across machine translation and language modeling, it achieves coverage close to the target with tighter prediction sets than baselines, and demonstrates robustness under distributional drift, aided by adaptive prediction sets. The approach offers a principled, scalable way to constrain generation with statistical guarantees and flexible uncertainty control, supported by open-source code and extensive empirical analysis.

Abstract

Paper Structure (32 sections, 10 equations, 4 figures, 10 tables, 1 algorithm)

This paper contains 32 sections, 10 equations, 4 figures, 10 tables, 1 algorithm.

Introduction
Contributions.
Related Work
Conformal Prediction.
Uncertainty in NLP.
Background
Conformal Prediction.
Non-exchangeable Conformal Prediction.
Method: Non-exchangeable Conformal Language Generation through Nearest Neighbors
Adaptive Prediction Sets.
Experiments
Evaluating Coverage
Evaluation.
Results.
Coverage Under Shift
...and 17 more sections

Figures (4)

Figure 1: Schematic representation of our approach. A decoder hidden representation $\mathop{\mathrm{\mathbf{z}}}\nolimits_t$ is used during inference to retrieve the nearest neighbors and their non-conformity scores $s_k$. Their relevance is determined by using their distance to compute weights $w_k$, resulting in the quantile $\hat{q}$ that forms conformal prediction sets.
Figure 2: Conditional coverage for the M2M100 on de $\rightarrow$ en with the small 418M model (\ref{['subfig:stratified-coverage-deen-nucleus', 'subfig:stratified-coverage-deen-conformal-nucleus', 'subfig:stratified-coverage-deen']}) and using the bigger 1.2B model (\ref{['subfig:stratified-coverage-deen-large']}). We aggregate predictions by set size using $75$ equally-spaced bins in total. The blue curve shows the conditional coverage per bin, whereas red bars show the number of binned predictions.
Figure 3: Coverage, average set size and $\hat{q}$ based on the noise level on the de $\rightarrow$ en MT task (top) and open text generation task (bottom). Error bars show one standard deviation.
Figure 4: Additional conditional coverage plots for the MT and LM dataset using our non-exchangeable conformal prediction method, aggregating predictions by prediction set size. The blue curve shows the conditional coverage per bin, whereas red bars show the number of predictions per bin. For \ref{['subfig:tratified-coverage-opt', 'subfig:stratified-coverage-opt-large']}, we zoom in on the prediction set sizes from $1$ and $100$.

Non-Exchangeable Conformal Language Generation with Nearest Neighbors

TL;DR

Abstract

Non-Exchangeable Conformal Language Generation with Nearest Neighbors

Authors

TL;DR

Abstract

Table of Contents

Figures (4)