Table of Contents
Fetching ...

Conditional Factuality Controlled LLMs with Generalization Certificates via Conformal Sampling

Kai Ye, Qingtao Pan, Shuo Li

Abstract

Large language models (LLMs) need reliable test-time control of hallucinations. Existing conformal methods for LLMs typically provide only \emph{marginal} guarantees and rely on a single global threshold, which can under-cover hard prompts, over-cover easy ones, and produce oversized prediction sets. We propose \emph{Conditional Factuality Control} (CFC), a post-hoc conformal framework that returns \emph{set-valued} outputs with \emph{conditional} coverage guarantees. CFC defines a continuous, feature-conditional acceptance threshold through augmented quantile regression on a latent ``success'' score, and deploys it through a fixed-point threshold rule at inference time. Theoretically, we show that CFC satisfies a conditional coverage guarantee under exchangeability and analyze its \emph{efficiency}, proving that, under mild assumptions on the score distributions, the conditional rule is strictly more sample-efficient than marginal conformal prediction at the same target coverage. We further derive a PAC-style variant, CFC-PAC, which shrinks the nominal risk level based on a stability bound, yielding a finite-sample certificate that the conditional miscoverage deviates from the target by at most $O(\sqrt{\log(1/δ)/N})$. Empirically, on synthetic data, real-world reasoning and QA benchmarks, and a Flickr8k VLM setting, CFC and CFC-PAC consistently attain near-target coverage across difficulty groups while using smaller prediction sets than CP and non-CP baselines.

Conditional Factuality Controlled LLMs with Generalization Certificates via Conformal Sampling

Abstract

Large language models (LLMs) need reliable test-time control of hallucinations. Existing conformal methods for LLMs typically provide only \emph{marginal} guarantees and rely on a single global threshold, which can under-cover hard prompts, over-cover easy ones, and produce oversized prediction sets. We propose \emph{Conditional Factuality Control} (CFC), a post-hoc conformal framework that returns \emph{set-valued} outputs with \emph{conditional} coverage guarantees. CFC defines a continuous, feature-conditional acceptance threshold through augmented quantile regression on a latent ``success'' score, and deploys it through a fixed-point threshold rule at inference time. Theoretically, we show that CFC satisfies a conditional coverage guarantee under exchangeability and analyze its \emph{efficiency}, proving that, under mild assumptions on the score distributions, the conditional rule is strictly more sample-efficient than marginal conformal prediction at the same target coverage. We further derive a PAC-style variant, CFC-PAC, which shrinks the nominal risk level based on a stability bound, yielding a finite-sample certificate that the conditional miscoverage deviates from the target by at most . Empirically, on synthetic data, real-world reasoning and QA benchmarks, and a Flickr8k VLM setting, CFC and CFC-PAC consistently attain near-target coverage across difficulty groups while using smaller prediction sets than CP and non-CP baselines.

Paper Structure

This paper contains 58 sections, 5 theorems, 108 equations, 6 figures, 13 tables, 2 algorithms.

Key Result

Theorem 4.1

Let $\mathcal{F}=\{\Phi(X)^\top\beta: \beta\in\mathbb{R}^d\}$ be any finite-dimensional linear class, and assume exchangeability. Then for any non-negative $f\in\mathcal{F}$ with $\mathbb E[f(X)]>0$, the prediction set in Eq. eq:pre_set satisfies

Figures (6)

  • Figure 1: Limitation of marginal CP and advantage of proposed CFC. Left: A single global threshold learned from the marginal score mixture yields only marginal coverage and can under‑cover hard prompts while over‑covering easy ones. Right: Our CFC learns a data‑dependent threshold via conformal quantile regression, adapting the acceptance level to input features and achieving conditional coverage across subgroups.
  • Figure 2: Groupwise miscoverage on synthetic data across 10 difficulty bins. The dashed line marks the target miscoverage $\alpha = 0.10$. Learnt CP improves over marginal baselines, but CFC and CFC-P remain closest to the target across all bins, especially on hard prompts.
  • Figure 3: Learned threshold $\widehat{\lambda}_\alpha(X)$ versus prompt difficulty. Easy prompts receive stricter thresholds, while harder prompts receive looser thresholds, which is the mechanism behind the improved group-wise reliability of CFC.
  • Figure 4: Groupwise miscoverage on real-world datasets at representative target errors: TriviaQA $\alpha=0.25$, GSM8K $\alpha=0.10$, and Flickr8k $\alpha=0.03$. Bars show mean miscoverage over split seeds and the upper error bars show one standard deviation. The TriviaQA panel uses the chosen two-group feature map; Appendix \ref{['app:exp-details']} gives its exact construction. The GSM8K and Flickr8k panels use five equal-frequency difficulty groups ordered from easy to hard. The dashed line marks the target miscoverage $\alpha$. Across all three datasets, the conditional methods flatten the miscoverage profile relative to marginal baselines, especially on the hardest groups.
  • Figure 5: Learned threshold $\widehat{\lambda}_\alpha(X)$ versus prompt difficulty in the updated synthetic run ($\alpha=0.10$, 5 bins). Easy prompts receive stricter thresholds, while harder prompts receive looser thresholds, explaining the improved group-wise coverage of CFC relative to global-threshold baselines.
  • ...and 1 more figures

Theorems & Definitions (9)

  • Theorem 4.1: Conditional coverage of CFC
  • Theorem 4.2: PAC conditional coverage for CFC
  • Proposition 4.3: Oracle CFC efficiency
  • Theorem 4.4: CFC inherits oracle efficiency
  • Theorem A.1: Gibbs et al. gibbs2024conformal, Theorem 2
  • proof
  • proof
  • proof
  • proof