Table of Contents
Fetching ...

Set-Valued Prediction for Large Language Models with Feasibility-Aware Coverage Guarantees

Ye Li, Anqi Hu, Yuanchang Ye, Shiyan Tong, Zhiyuan Wang, Bo Fu

Abstract

Large language models (LLMs) inherently operate over a large generation space, yet conventional usage typically reports the most likely generation (MLG) as a point prediction, which underestimates the model's capability: although the top-ranked response can be incorrect, valid answers may still exist within the broader output space and can potentially be discovered through repeated sampling. This observation motivates moving from point prediction to set-valued prediction, where the model produces a set of candidate responses rather than a single MLG. In this paper, we propose a principled framework for set-valued prediction, which provides feasibility-aware coverage guarantees. We show that, given the finite-sampling nature of LLM generation, coverage is not always achievable: even with multiple samplings, LLMs may fail to yield an acceptable response for certain questions within the sampled candidate set. To address this, we establish a minimum achievable risk level (MRL), below which statistical coverage guarantees cannot be satisfied. Building on this insight, we then develop a data-driven calibration procedure that constructs prediction sets from sampled responses by estimating a rigorous threshold, ensuring that the resulting set contains a correct answer with a desired probability whenever the target risk level is feasible. Extensive experiments on six language generation tasks with five LLMs demonstrate both the statistical validity and the predictive efficiency of our framework.

Set-Valued Prediction for Large Language Models with Feasibility-Aware Coverage Guarantees

Abstract

Large language models (LLMs) inherently operate over a large generation space, yet conventional usage typically reports the most likely generation (MLG) as a point prediction, which underestimates the model's capability: although the top-ranked response can be incorrect, valid answers may still exist within the broader output space and can potentially be discovered through repeated sampling. This observation motivates moving from point prediction to set-valued prediction, where the model produces a set of candidate responses rather than a single MLG. In this paper, we propose a principled framework for set-valued prediction, which provides feasibility-aware coverage guarantees. We show that, given the finite-sampling nature of LLM generation, coverage is not always achievable: even with multiple samplings, LLMs may fail to yield an acceptable response for certain questions within the sampled candidate set. To address this, we establish a minimum achievable risk level (MRL), below which statistical coverage guarantees cannot be satisfied. Building on this insight, we then develop a data-driven calibration procedure that constructs prediction sets from sampled responses by estimating a rigorous threshold, ensuring that the resulting set contains a correct answer with a desired probability whenever the target risk level is feasible. Extensive experiments on six language generation tasks with five LLMs demonstrate both the statistical validity and the predictive efficiency of our framework.
Paper Structure (27 sections, 1 theorem, 21 equations, 10 figures)

This paper contains 27 sections, 1 theorem, 21 equations, 10 figures.

Key Result

Theorem 3.1

Assume that the augmented tuples for $n$ calibration samples and the test data, $\{ (x_i, y_i^*, \{\hat{y}_k^{(i)}\}_{k=1}^{K}) \}_{i=1}^{n+1}$, are exchangeable. Let $\hat{\lambda}$ be defined as above. Then, for any target risk level $\alpha \ge \alpha_l$, the calibrated prediction set $\mathcal{C Equivalently,

Figures (10)

  • Figure 1: Overview of our feasibility-aware calibration framework with statistically rigorous coverage guarantees.
  • Figure 2: (a-c) Comparison between point-prediction accuracy and set-valued attainability under different semantic matching thresholds. The consistent gap shows that point prediction systematically under-utilizes admissible answers already present in the sampled candidate space, and (d) that this gap widens under stricter semantic evaluation.
  • Figure 3: Coverage Guarantees on six NLG benchmarks utilizing five LLMs. The threshold of sentence similarity is fixed at 0.7.
  • Figure 4: Coverage Guarantees on six NLG benchmarks utilizing five LLMs. The threshold of sentence similarity is fixed at 0.5.
  • Figure 5: Coverage Guarantees on six NLG benchmarks utilizing five LLMs. The threshold of sentence similarity is fixed at 0.6.
  • ...and 5 more figures

Theorems & Definitions (1)

  • Theorem 3.1: Coverage Guaranteed Threshold