API Is Enough: Conformal Prediction for Large Language Models Without Logit-Access

Jiayuan Su; Jing Luo; Hongwei Wang; Lu Cheng

API Is Enough: Conformal Prediction for Large Language Models Without Logit-Access

Jiayuan Su, Jing Luo, Hongwei Wang, Lu Cheng

TL;DR

This work tackles uncertainty quantification for API-only LLMs that do not expose logits. It introduces LofreeCP, a logit-free conformal predictor that combines a frequency-based ranking proxy with two fine-grained uncertainty notions—normalized entropy (NE) and semantic similarity (SS)—to form nonconformity scores and calibrated prediction sets under CP. The authors show that frequency-only probability estimation is computationally infeasible and establish a formal coverage guarantee for LofreeCP. Empirically, LofreeCP achieves smaller average prediction set sizes (APSS) and competitive or superior coverage compared with logit-based CP baselines on TriviaQA, WebQuestions, and MMLU. This approach enables practical, calibrated uncertainty estimation for API-based LLMs and broadens CP applicability beyond access to internal model logits.

Abstract

This study aims to address the pervasive challenge of quantifying uncertainty in large language models (LLMs) without logit-access. Conformal Prediction (CP), known for its model-agnostic and distribution-free features, is a desired approach for various LLMs and data distributions. However, existing CP methods for LLMs typically assume access to the logits, which are unavailable for some API-only LLMs. In addition, logits are known to be miscalibrated, potentially leading to degraded CP performance. To tackle these challenges, we introduce a novel CP method that (1) is tailored for API-only LLMs without logit-access; (2) minimizes the size of prediction sets; and (3) ensures a statistical guarantee of the user-defined coverage. The core idea of this approach is to formulate nonconformity measures using both coarse-grained (i.e., sample frequency) and fine-grained uncertainty notions (e.g., semantic similarity). Experimental results on both close-ended and open-ended Question Answering tasks show our approach can mostly outperform the logit-based CP baselines.

API Is Enough: Conformal Prediction for Large Language Models Without Logit-Access

TL;DR

Abstract

Paper Structure (35 sections, 3 theorems, 17 equations, 9 figures, 7 tables)

This paper contains 35 sections, 3 theorems, 17 equations, 9 figures, 7 tables.

Introduction
Preliminaries of Conformal Prediction
Methodology
Frequency As the Rankings Proxy
Fine-grained Uncertainty Notions
CP for LLMs Without Logit-Access
Experiments
Experimental Setup
Results for QA
Ablation Study
Results for MCQ
Sensitivity Analyses
Related Work
Conclusion
Theoretical Proofs
...and 20 more sections

Key Result

Theorem 2.1

Suppose $(X_i, Y_i)_{i=1,...,n}$ and $(X_{\text{test}}, Y_{\text{test}})$ are independent and identically distributed (i.i.d.). $C_{1-\alpha}(X_{test})$ is a set-valued mapping satisfying the nesting property in Eq. nesting-pro. Then the following holds: where $\alpha \in (0, 1)$ is the user-defined error rate.

Figures (9)

Figure 1: Illustrations of the proposed problem and solution. Three uncertainty notions for measuring nonconformity: (1) Frequency-only, where the nonconformity score is calculated as $1-$the frequency of a response out of 10 samplings. Concentration issues arise at scores of 0.6, 0.7, and 0.8. For instance, responses from different prompts (e.g., "Big Bill Broonzy" and "Joan Rivers") have the same score of 0.6, as well as responses within the same prompt (e.g., "Bill Boonzy" and "Sir William Rockington") which both have a score of 0.7, and so forth. (2) Frequency combined with NE, where the nonconformity score is calculated as $1-$frequency$+$NE, revealing concentration issues at scores of 0.75 and 0.86. (3) Frequency, NE, and SS combined, where the nonconformity score is calculated as $1-$frequency$+$NE$-$SS, with no observed concentration issues.
Figure 2: Empirical findings with TriviaQA dataset.
Figure 3: Ablation study. The blue bar chart represents APSS, while the gray line represents ECR.
Figure 4: Results on MCQ task, with the error rate of 0.2. Our method and baselines are applied individually to each of the 16 subjects.
Figure 5: Results of the sensitivity analysis for different backbone models: Llama-2-7b and Llama-2-13b.
...and 4 more figures

Theorems & Definitions (5)

Theorem 2.1: Conformal coverage guarantee
Lemma 3.1: Minimum Sample Size for Confident Probability Estimation
Proposition 3.2: Coverage guarantee of LofreeCP
proof : Proof
proof : Proof

API Is Enough: Conformal Prediction for Large Language Models Without Logit-Access

TL;DR

Abstract

API Is Enough: Conformal Prediction for Large Language Models Without Logit-Access

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (5)