Table of Contents
Fetching ...

Large language model validity via enhanced conformal prediction methods

John J. Cherian, Isaac Gibbs, Emmanuel J. Candès

TL;DR

New conformal inference methods for obtaining validity guarantees on the output of large language models (LLMs) are developed and how to systematically improve the quality of the scoring function via a novel algorithm for differentiating through the conditional conformal procedure is shown.

Abstract

We develop new conformal inference methods for obtaining validity guarantees on the output of large language models (LLMs). Prior work in conformal language modeling identifies a subset of the text that satisfies a high-probability guarantee of correctness. These methods work by filtering claims from the LLM's original response if a scoring function evaluated on the claim fails to exceed a threshold calibrated via split conformal prediction. Existing methods in this area suffer from two deficiencies. First, the guarantee stated is not conditionally valid. The trustworthiness of the filtering step may vary based on the topic of the response. Second, because the scoring function is imperfect, the filtering step can remove many valuable and accurate claims. We address both of these challenges via two new conformal methods. First, we generalize the conditional conformal procedure of Gibbs et al. (2023) in order to adaptively issue weaker guarantees when they are required to preserve the utility of the output. Second, we show how to systematically improve the quality of the scoring function via a novel algorithm for differentiating through the conditional conformal procedure. We demonstrate the efficacy of our approach on biography and medical question-answering datasets.

Large language model validity via enhanced conformal prediction methods

TL;DR

New conformal inference methods for obtaining validity guarantees on the output of large language models (LLMs) are developed and how to systematically improve the quality of the scoring function via a novel algorithm for differentiating through the conditional conformal procedure is shown.

Abstract

We develop new conformal inference methods for obtaining validity guarantees on the output of large language models (LLMs). Prior work in conformal language modeling identifies a subset of the text that satisfies a high-probability guarantee of correctness. These methods work by filtering claims from the LLM's original response if a scoring function evaluated on the claim fails to exceed a threshold calibrated via split conformal prediction. Existing methods in this area suffer from two deficiencies. First, the guarantee stated is not conditionally valid. The trustworthiness of the filtering step may vary based on the topic of the response. Second, because the scoring function is imperfect, the filtering step can remove many valuable and accurate claims. We address both of these challenges via two new conformal methods. First, we generalize the conditional conformal procedure of Gibbs et al. (2023) in order to adaptively issue weaker guarantees when they are required to preserve the utility of the output. Second, we show how to systematically improve the quality of the scoring function via a novel algorithm for differentiating through the conditional conformal procedure. We demonstrate the efficacy of our approach on biography and medical question-answering datasets.
Paper Structure (32 sections, 8 theorems, 43 equations, 14 figures, 1 algorithm)

This paper contains 32 sections, 8 theorems, 43 equations, 14 figures, 1 algorithm.

Key Result

Theorem 2.1

Let $\mathcal{F} = \{\Phi(X)^\top\beta : \beta \in \mathbb{R}^d\}$ denote any finite dimensional linear class. Assume that $\{(X_i,S_i)\}_{i=1}^{n+1}$ are exchangeable and that solutions to eq:original_qr and its dual are computed symmetrically on the input data. Then, for all $f \in \mathcal{F}$,

Figures (14)

  • Figure 1: The left panel displays the output of GPT-3.5-Turbo for the prompt "How often is a shingles vaccine required?" The first filtered output (center) is calibrated using the frequency score (see \ref{['sec:app_claim_score']}) and the marginally valid conformal factuality method of mohri2024language at a fixed level of 90%. The second filtered output (right) is calibrated using a score obtained via our conditional boosting procedure (\ref{['sec:boosting']}) at a level of 63%, which is chosen and calibrated using our adaptive method (\ref{['sec:level-adaptive']}) to approximately ensure that at least 70% of the claims are retained. Both filtered outputs are guaranteed to include no false claims with the stated probability.
  • Figure 2: Empirical demonstration of our methods. The panels display results for our conditional boosting and level-adaptive methods. We aim to issue outputs with $0$ factual errors, and for the latter method, we choose the level with the objective of retaining at least 70% of the original claims in the prompt. The left panel compares the binned nominal probabilities of factuality reported by our method against the realized probability of factuality for data points belonging to each bin. These probabilities are estimated using $500$ test points over 100 calibration-test splits. The plotted bins, which are also given as inputs to our method, are $[0.5, 0.55], [0.55, 0.6],\dots,[0.8, 0.85]$. Finally, the right-hand panel displays the claim retention obtained with unboosted scores (blue), boosted scores (orange), and boosted scores + level-adaptive CP (green). The first two methods are implemented at a fixed error rate of $\alpha = 0.1$. Boxplots in this panel show the distribution of retained claims for 100 calibration-test splits with each containing 2354 calibration points and 500 test points.
  • Figure 3: Comparison of the marginal boosting procedure of stutz2021learning (blue) against our conditional boosting method (orange). The left panel shows the prediction sets produced by each method, while the right panel displays the conditional coverage, $\mathbb{P}(Y_{n+1} \in \hat{C}(X_{n+1}) \mid X^{(1)}_{n+1})$ against the values of the first feature, $X_{n+1}^{(1)}$. Using the Adam optimizer with learning rate set to $0.001$, the plotted scores are boosted for $500$ steps on a synthetic dataset of size $n = 1000$ and evaluated on another test dataset of size $n = 2000$.
  • Figure 4: Performance of our level-adaptive method on a synthetic dataset. We use $n = 2000$ points to estimate the adaptive $\alpha(\cdot)$ function. We then run the level-adaptive method on a calibration set of size $n = 1000$ and evaluate the method on $1$ test point; the plotted points (center, right) are obtained from $200$ trials. The left panel shows the distribution of interval lengths obtained on the test set for fixed values of $\alpha \in \{0.05,0.25,0.5,0.75,0.95\}$. The center panel displays the interval lengths obtained when the level $\alpha(X_{n+1})$ is now chosen adaptively to ensure a maximum prediction set size of at most 500 (red line). Finally, the right panel compares the realized coverage $\mathbb{P}(Y_{n+1} \in \hat{C}(X_{n+1}) \mid \alpha(X_{n+1}))$ against the nominal level, $1-\alpha(X_{n+1})$ reported by our method with $\mathcal{F} = \{\beta_0 + \sum_{i = 1}^2 \beta_1 x^i + \beta_3 \alpha(x) \mid \beta \in \mathbb{R}^4 \}$.
  • Figure 5: Empirical demonstration of our method on the Wikipedia biographies dataset. The left two panels display results for our level-adaptive method, which aims to issue biographies with 3 or fewer errors, while retaining at least 80% of the original claims in the prompt. The left panel compares the binned nominal probabilities reported by our method with bin width $2.5\%$ against the true realized empirical values evaluated over $381$ test points and 100 calibration-test splits. The center panel compares the number of claims retained by our method (orange) against the fixed level method of mohri2024language (blue). In this plot, the $y$-axis displays a moving average of the number of claims retained with window size $1000$, while the $x$-axis a moving average of the number of views (in Jan. 2023) of the Wikipedia article associated with each prompt on the log-scale. Finally, the right-hand panel displays the claim retention obtained with boosted (orange) and unboosted (blue) scores at a fixed level of $\alpha = 0.1$. Boxplots in this panel show the distribution of retentions for 100 calibration-test splits each containing 7246 calibration points and 381 test points.
  • ...and 9 more figures

Theorems & Definitions (8)

  • Theorem 2.1: Proposition 4 of gibbs2023conformal
  • Theorem 3.1
  • Theorem 3.2
  • Proposition 3.1
  • Theorem A.1
  • Corollary A.1: Extension of Theorem \ref{['thm:fixed_alpha_adapt_cov_gen_loss']}
  • Corollary A.2: Extension of Theorem \ref{['thm:adapt_alpha_gen_loss']}
  • Proposition C.1