Table of Contents
Fetching ...

SAFER: Risk-Constrained Sample-then-Filter in Large Language Models

Qingni Wang, Yue Fan, Xin Eric Wang

TL;DR

SAFER addresses the challenge of risk-controlled, open-ended QA with large language models by introducing a two-stage framework that first abstains when the desired risk bound is unattainable and then applies conformalized filtering to prune unreliable answers. It calibrates a minimum test-time sampling budget ${\hat{s}}$ on a held-out calibration set using the Clopper–Pearson upper bound under risk ${\alpha}$, abstaining if the bound cannot be met, ensuring the final sampling set has guaranteed coverage for admissible answers. In the second stage, SAFER calibrates a threshold ${\hat{t}}$ via conformal risk control with risk ${\beta}$ to filter candidates by their uncertainty $U(\hat{y})$, preserving coverage within the joint bound ${\alpha}+{\beta}-{\alpha}{\beta}$. The approach demonstrates two-stage risk control across TriviaQA, CoQA, and ScienceQA with multiple open-source LLMs, achieving statistically valid miscoverage bounds while reducing final prediction-set sizes and showing robustness to different correctness definitions. This yields a practical, data-efficient, model-agnostic method for trustworthy, uncertainty-aware decision-making in foundation models for real-world QA tasks.

Abstract

As large language models (LLMs) are increasingly deployed in risk-sensitive applications such as real-world open-ended question answering (QA), ensuring the trustworthiness of their outputs has become critical. Existing selective conformal prediction (SCP) methods provide statistical guarantees by constructing prediction sets with a constrained miscoverage rate for correct answers. However, prior works unrealistically assume that admissible answers for all instances can be obtained via finite sampling, even for open-ended QA scenarios that lack a fixed and finite solution space. To address this, we introduce a two-stage risk control framework comprising abstention-aware sampling and conformalized filtering (SAFER). Firstly, on a held-out calibration set, SAFER calibrates a sampling budget within the maximum sampling cap, using the Clopper-Pearson exact method at a user-desired risk level (i.e., the maximum allowable miscoverage rate of the sampling sets). If the risk level cannot be satisfied within the cap, we abstain; otherwise, the calibrated sampling budget becomes the minimum requirements at test time. Then, we employ calibration instances where correct answers are attainable under the calibrated budget and apply the conformal risk control method to determine a statistically valid uncertainty threshold, which filters unreliable distractors from the candidate set for each test data point. In this stage, SAFER introduces an additional risk level to guide the calculation of the threshold, thereby controlling the risk of correct answers being excluded. Furthermore, we show that SAFER is compatible with various task-specific admission criteria and calibration-test split ratios, highlighting its robustness and high data efficiency.

SAFER: Risk-Constrained Sample-then-Filter in Large Language Models

TL;DR

SAFER addresses the challenge of risk-controlled, open-ended QA with large language models by introducing a two-stage framework that first abstains when the desired risk bound is unattainable and then applies conformalized filtering to prune unreliable answers. It calibrates a minimum test-time sampling budget on a held-out calibration set using the Clopper–Pearson upper bound under risk , abstaining if the bound cannot be met, ensuring the final sampling set has guaranteed coverage for admissible answers. In the second stage, SAFER calibrates a threshold via conformal risk control with risk to filter candidates by their uncertainty , preserving coverage within the joint bound . The approach demonstrates two-stage risk control across TriviaQA, CoQA, and ScienceQA with multiple open-source LLMs, achieving statistically valid miscoverage bounds while reducing final prediction-set sizes and showing robustness to different correctness definitions. This yields a practical, data-efficient, model-agnostic method for trustworthy, uncertainty-aware decision-making in foundation models for real-world QA tasks.

Abstract

As large language models (LLMs) are increasingly deployed in risk-sensitive applications such as real-world open-ended question answering (QA), ensuring the trustworthiness of their outputs has become critical. Existing selective conformal prediction (SCP) methods provide statistical guarantees by constructing prediction sets with a constrained miscoverage rate for correct answers. However, prior works unrealistically assume that admissible answers for all instances can be obtained via finite sampling, even for open-ended QA scenarios that lack a fixed and finite solution space. To address this, we introduce a two-stage risk control framework comprising abstention-aware sampling and conformalized filtering (SAFER). Firstly, on a held-out calibration set, SAFER calibrates a sampling budget within the maximum sampling cap, using the Clopper-Pearson exact method at a user-desired risk level (i.e., the maximum allowable miscoverage rate of the sampling sets). If the risk level cannot be satisfied within the cap, we abstain; otherwise, the calibrated sampling budget becomes the minimum requirements at test time. Then, we employ calibration instances where correct answers are attainable under the calibrated budget and apply the conformal risk control method to determine a statistically valid uncertainty threshold, which filters unreliable distractors from the candidate set for each test data point. In this stage, SAFER introduces an additional risk level to guide the calculation of the threshold, thereby controlling the risk of correct answers being excluded. Furthermore, we show that SAFER is compatible with various task-specific admission criteria and calibration-test split ratios, highlighting its robustness and high data efficiency.

Paper Structure

This paper contains 35 sections, 31 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: Overview of SAFER's calibration and test-time process. In Stage I, we derive a statistically valid minimum sample budget $\hat{s}$ that can strictly control the test-time risk of the candidate set of size $\hat{s}$ not covering correct answers. In Stage II, we employ the calibration instances, which can obtain admissible answers within $\hat{s}$ samples, to calibrate a threshold $\hat{t}$. This threshold filters out unreliable answers in the candidate set while still constraining the miscoverage risk of the final prediction set.
  • Figure 2: Empirical miscoverage rates under different sampling budgets. The dashed lines denote the system miscoverage upper bounds derived via the Clopper–Pearson exact method on the calibration set, while the solid lines show the corresponding empirical miscoverage rates on the test set.
  • Figure 3: Comparison of TRON and SAFER on the control of test-time EER in the sampling stage.
  • Figure 4: Comparison of the (average) calibrated sampling budget in the sampling stage with the (average) prediction set size after filtering out unreliable answers in the second stage. Filtering compresses the calibrated budgets into tighter sets while maintaining risk control.
  • Figure 5: Test-time EER results in the sampling stage with Rouge-L score as the correctness metric.
  • ...and 10 more figures