Table of Contents
Fetching ...

LEC: Linear Expectation Constraints for False-Discovery Control in Selective Prediction and Routing Systems

Zhiyuan Wang, Aniri, Tianlong Chen, Yue Zhang, Heng Tao Shen, Xiaoshuang Shi, Kaidi Xu

TL;DR

This work tackles the unreliability of LLM outputs by introducing Linear Expectation Constraints (LEC) to enforce false discovery rate (FDR) control in selective prediction. LEC reframes the problem as constrained decision-making and derives finite-sample conditions, computable from calibration data, to guarantee test-time FDR below a user-specified level while maximizing coverage. It extends naturally to two-model routing, maintaining a unified FDR guarantee while routing uncertain cases to a stronger model to boost efficiency. Empirically, LEC achieves tighter calibration and higher acceptance than prior confidence-bound methods across closed- and open-ended QA, and routing further improves correct acceptance beyond any single model.

Abstract

Large language models (LLMs) often generate unreliable answers, while heuristic uncertainty methods fail to fully distinguish correct from incorrect predictions, causing users to accept erroneous answers without statistical guarantees. We address this issue through the lens of false discovery rate (FDR) control, ensuring that among all accepted predictions, the proportion of errors does not exceed a target risk level. To achieve this in a principled way, we propose LEC, which reinterprets selective prediction as a constrained decision problem by enforcing a Linear Expectation Constraint over selection and error indicators. Then, we establish a finite-sample sufficient condition, which relies only on a held-out set of exchangeable calibration samples, to compute an FDR-constrained, coverage-maximizing threshold. Furthermore, we extend LEC to a two-model routing mechanism: given a prompt, if the current model's uncertainty exceeds its calibrated threshold, we delegate it to a stronger model, while maintaining a unified FDR guarantee. Evaluations on closed-ended and open-ended question-answering (QA) datasets show that LEC achieves tighter FDR control and substantially improves sample retention over prior methods. Moreover, the two-model routing mechanism achieves lower risk levels while accepting more correct samples than each individual model.

LEC: Linear Expectation Constraints for False-Discovery Control in Selective Prediction and Routing Systems

TL;DR

This work tackles the unreliability of LLM outputs by introducing Linear Expectation Constraints (LEC) to enforce false discovery rate (FDR) control in selective prediction. LEC reframes the problem as constrained decision-making and derives finite-sample conditions, computable from calibration data, to guarantee test-time FDR below a user-specified level while maximizing coverage. It extends naturally to two-model routing, maintaining a unified FDR guarantee while routing uncertain cases to a stronger model to boost efficiency. Empirically, LEC achieves tighter calibration and higher acceptance than prior confidence-bound methods across closed- and open-ended QA, and routing further improves correct acceptance beyond any single model.

Abstract

Large language models (LLMs) often generate unreliable answers, while heuristic uncertainty methods fail to fully distinguish correct from incorrect predictions, causing users to accept erroneous answers without statistical guarantees. We address this issue through the lens of false discovery rate (FDR) control, ensuring that among all accepted predictions, the proportion of errors does not exceed a target risk level. To achieve this in a principled way, we propose LEC, which reinterprets selective prediction as a constrained decision problem by enforcing a Linear Expectation Constraint over selection and error indicators. Then, we establish a finite-sample sufficient condition, which relies only on a held-out set of exchangeable calibration samples, to compute an FDR-constrained, coverage-maximizing threshold. Furthermore, we extend LEC to a two-model routing mechanism: given a prompt, if the current model's uncertainty exceeds its calibrated threshold, we delegate it to a stronger model, while maintaining a unified FDR guarantee. Evaluations on closed-ended and open-ended question-answering (QA) datasets show that LEC achieves tighter FDR control and substantially improves sample retention over prior methods. Moreover, the two-model routing mechanism achieves lower risk levels while accepting more correct samples than each individual model.

Paper Structure

This paper contains 17 sections, 2 theorems, 50 equations, 11 figures, 5 tables.

Key Result

Theorem 3.1

Assume that calibration and test examples are exchangeable angelopoulos2023conformal. Let $\hat{\lambda}^{(a)}$ be defined by Eq. eq:max-crc-single using $\mathcal{D}_{\mathrm{cal}}$. Then, for a new test sample $(x_{n+1},y_{n+1}^*)$ with $(u_{n+1}^{(a)}, err_{n+1}^{(a)})$, where the probability is taken over the joint randomness of the calibration set and the test sample (marginal guarantee).

Figures (11)

  • Figure 1: Illustration of selective prediction in single-model and two-model routing systems.
  • Figure 2: FDR control at various $\alpha$ on both the CommonsenseQA and TriviaQA datasets with seven LLMs (mean±std).
  • Figure 3: Upper confidence bound vs. Test-time FDR at various uncertainty thresholds. In (a), we use LLaMA-3.1-8B, with white-box PE as the uncertainty measure; In (b), we use Qwen2.5-14B, with black-box SE as the uncertainty measure.
  • Figure 4: Comparison of test-time FDR on the CommonsenseQA dataset across seven LLMs (mean).
  • Figure 5: FDR control of two LLMs routing at various risk levels on the CommonsenseQA dataset (mean±std).
  • ...and 6 more figures

Theorems & Definitions (2)

  • Theorem 3.1: Single-model FDR control
  • Theorem 3.2: FDR control for the two-model routing system