Table of Contents
Fetching ...

API Is Enough: Conformal Prediction for Large Language Models Without Logit-Access

Jiayuan Su, Jing Luo, Hongwei Wang, Lu Cheng

TL;DR

This work tackles uncertainty quantification for API-only LLMs that do not expose logits. It introduces LofreeCP, a logit-free conformal predictor that combines a frequency-based ranking proxy with two fine-grained uncertainty notions—normalized entropy (NE) and semantic similarity (SS)—to form nonconformity scores and calibrated prediction sets under CP. The authors show that frequency-only probability estimation is computationally infeasible and establish a formal coverage guarantee for LofreeCP. Empirically, LofreeCP achieves smaller average prediction set sizes (APSS) and competitive or superior coverage compared with logit-based CP baselines on TriviaQA, WebQuestions, and MMLU. This approach enables practical, calibrated uncertainty estimation for API-based LLMs and broadens CP applicability beyond access to internal model logits.

Abstract

This study aims to address the pervasive challenge of quantifying uncertainty in large language models (LLMs) without logit-access. Conformal Prediction (CP), known for its model-agnostic and distribution-free features, is a desired approach for various LLMs and data distributions. However, existing CP methods for LLMs typically assume access to the logits, which are unavailable for some API-only LLMs. In addition, logits are known to be miscalibrated, potentially leading to degraded CP performance. To tackle these challenges, we introduce a novel CP method that (1) is tailored for API-only LLMs without logit-access; (2) minimizes the size of prediction sets; and (3) ensures a statistical guarantee of the user-defined coverage. The core idea of this approach is to formulate nonconformity measures using both coarse-grained (i.e., sample frequency) and fine-grained uncertainty notions (e.g., semantic similarity). Experimental results on both close-ended and open-ended Question Answering tasks show our approach can mostly outperform the logit-based CP baselines.

API Is Enough: Conformal Prediction for Large Language Models Without Logit-Access

TL;DR

This work tackles uncertainty quantification for API-only LLMs that do not expose logits. It introduces LofreeCP, a logit-free conformal predictor that combines a frequency-based ranking proxy with two fine-grained uncertainty notions—normalized entropy (NE) and semantic similarity (SS)—to form nonconformity scores and calibrated prediction sets under CP. The authors show that frequency-only probability estimation is computationally infeasible and establish a formal coverage guarantee for LofreeCP. Empirically, LofreeCP achieves smaller average prediction set sizes (APSS) and competitive or superior coverage compared with logit-based CP baselines on TriviaQA, WebQuestions, and MMLU. This approach enables practical, calibrated uncertainty estimation for API-based LLMs and broadens CP applicability beyond access to internal model logits.

Abstract

This study aims to address the pervasive challenge of quantifying uncertainty in large language models (LLMs) without logit-access. Conformal Prediction (CP), known for its model-agnostic and distribution-free features, is a desired approach for various LLMs and data distributions. However, existing CP methods for LLMs typically assume access to the logits, which are unavailable for some API-only LLMs. In addition, logits are known to be miscalibrated, potentially leading to degraded CP performance. To tackle these challenges, we introduce a novel CP method that (1) is tailored for API-only LLMs without logit-access; (2) minimizes the size of prediction sets; and (3) ensures a statistical guarantee of the user-defined coverage. The core idea of this approach is to formulate nonconformity measures using both coarse-grained (i.e., sample frequency) and fine-grained uncertainty notions (e.g., semantic similarity). Experimental results on both close-ended and open-ended Question Answering tasks show our approach can mostly outperform the logit-based CP baselines.
Paper Structure (35 sections, 3 theorems, 17 equations, 9 figures, 7 tables)

This paper contains 35 sections, 3 theorems, 17 equations, 9 figures, 7 tables.

Key Result

Theorem 2.1

Suppose $(X_i, Y_i)_{i=1,...,n}$ and $(X_{\text{test}}, Y_{\text{test}})$ are independent and identically distributed (i.i.d.). $C_{1-\alpha}(X_{test})$ is a set-valued mapping satisfying the nesting property in Eq. nesting-pro. Then the following holds: where $\alpha \in (0, 1)$ is the user-defined error rate.

Figures (9)

  • Figure 1: Illustrations of the proposed problem and solution. Three uncertainty notions for measuring nonconformity: (1) Frequency-only, where the nonconformity score is calculated as $1-$the frequency of a response out of 10 samplings. Concentration issues arise at scores of 0.6, 0.7, and 0.8. For instance, responses from different prompts (e.g., "Big Bill Broonzy" and "Joan Rivers") have the same score of 0.6, as well as responses within the same prompt (e.g., "Bill Boonzy" and "Sir William Rockington") which both have a score of 0.7, and so forth. (2) Frequency combined with NE, where the nonconformity score is calculated as $1-$frequency$+$NE, revealing concentration issues at scores of 0.75 and 0.86. (3) Frequency, NE, and SS combined, where the nonconformity score is calculated as $1-$frequency$+$NE$-$SS, with no observed concentration issues.
  • Figure 2: Empirical findings with TriviaQA dataset.
  • Figure 3: Ablation study. The blue bar chart represents APSS, while the gray line represents ECR.
  • Figure 4: Results on MCQ task, with the error rate of 0.2. Our method and baselines are applied individually to each of the 16 subjects.
  • Figure 5: Results of the sensitivity analysis for different backbone models: Llama-2-7b and Llama-2-13b.
  • ...and 4 more figures

Theorems & Definitions (5)

  • Theorem 2.1: Conformal coverage guarantee
  • Lemma 3.1: Minimum Sample Size for Confident Probability Estimation
  • Proposition 3.2: Coverage guarantee of LofreeCP
  • proof : Proof
  • proof : Proof