Table of Contents
Fetching ...

E-Scores for (In)Correctness Assessment of Generative Model Outputs

Guneet S. Dhillon, Javier González, Teodora Pandeva, Alicia Curth

TL;DR

This work addresses the challenge of principled, post-hoc assessment of the (in)correctness of generative model outputs, particularly for LLMs, by moving from p-value based conformal methods to e-scores derived from e-values. The authors define a scalable framework that allows adaptive, data-driven post-hoc tolerance levels while guaranteeing a size distortion bound, formalized as $\mathbb{E}\left[ \frac{\mathbf{1}\{\exists(\mathbf{Y},\cdot)\in \mathds{S}_{\alpha}(X,g_{\pi}(X)) : o(X,\mathbf{Y})=0\}}{\alpha(X,\mathds{S}(X,g_{\pi}(X)))} \right] \le 1$ under exchangeability, and implement e-scores as the reciprocal of appropriately defined e-values. The method leverages a calibrated oracle estimator to proxy correctness, combines multiple e-scores for robustness, and demonstrates efficacy on two key tasks: mathematical factuality and property-constrained output (e.g., instruction-following, truthfulness). Empirical results show e-scores upper bound size distortion and maintain competitive precision-recall relative to p-scores, with advantages in memory/time efficiency. The theoretical and experimental findings indicate that e-scores enable reliable, post-hoc control over the (in)correctness of large-scale generated outputs across diverse use-cases and calibration data settings, paving the way for practical, robust evaluation of generative systems.

Abstract

While generative models, especially large language models (LLMs), are ubiquitous in today's world, principled mechanisms to assess their (in)correctness are limited. Using the conformal prediction framework, previous works construct sets of LLM responses where the probability of including an incorrect response, or error, is capped at a desired user-defined tolerance level. However, since these methods are based on p-values, they are susceptible to p-hacking, i.e., choosing the tolerance level post-hoc can invalidate the guarantees. We therefore leverage e-values to complement generative model outputs with e-scores as a measure of incorrectness. In addition to achieving the same statistical guarantees as before, e-scores provide users flexibility in adaptively choosing tolerance levels after observing the e-scores themselves, by upper bounding a post-hoc notion of error called size distortion. We experimentally demonstrate their efficacy in assessing LLM outputs for different correctness types: mathematical factuality and property constraints satisfaction.

E-Scores for (In)Correctness Assessment of Generative Model Outputs

TL;DR

This work addresses the challenge of principled, post-hoc assessment of the (in)correctness of generative model outputs, particularly for LLMs, by moving from p-value based conformal methods to e-scores derived from e-values. The authors define a scalable framework that allows adaptive, data-driven post-hoc tolerance levels while guaranteeing a size distortion bound, formalized as under exchangeability, and implement e-scores as the reciprocal of appropriately defined e-values. The method leverages a calibrated oracle estimator to proxy correctness, combines multiple e-scores for robustness, and demonstrates efficacy on two key tasks: mathematical factuality and property-constrained output (e.g., instruction-following, truthfulness). Empirical results show e-scores upper bound size distortion and maintain competitive precision-recall relative to p-scores, with advantages in memory/time efficiency. The theoretical and experimental findings indicate that e-scores enable reliable, post-hoc control over the (in)correctness of large-scale generated outputs across diverse use-cases and calibration data settings, paving the way for practical, robust evaluation of generative systems.

Abstract

While generative models, especially large language models (LLMs), are ubiquitous in today's world, principled mechanisms to assess their (in)correctness are limited. Using the conformal prediction framework, previous works construct sets of LLM responses where the probability of including an incorrect response, or error, is capped at a desired user-defined tolerance level. However, since these methods are based on p-values, they are susceptible to p-hacking, i.e., choosing the tolerance level post-hoc can invalidate the guarantees. We therefore leverage e-values to complement generative model outputs with e-scores as a measure of incorrectness. In addition to achieving the same statistical guarantees as before, e-scores provide users flexibility in adaptively choosing tolerance levels after observing the e-scores themselves, by upper bounding a post-hoc notion of error called size distortion. We experimentally demonstrate their efficacy in assessing LLM outputs for different correctness types: mathematical factuality and property constraints satisfaction.

Paper Structure

This paper contains 37 sections, 1 theorem, 32 equations, 3 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

If the test and the calibration prompts are exchangeable, then, our proposed e-scores in eq:e_scoreeq:combined_e_score upper bound the size distortion (marginal over the test and the calibration prompts) by 1, as in eq:size_distortion.

Figures (3)

  • Figure 1: E-scores example for mathematical factuality. This is an example prompt and response from the ProcessBench benchmark zheng2025processbench (cf. \ref{['subsec:experiments-processbench']}). The LLM's response is made up of 5 sub-responses, each a step in the mathematical reasoning (starting from the inner and ending on the outer block). The checks/crosses on the bottom left and the green/red colour of each block represent whether the response is correct/incorrect up till that point (which is not known at test-time). On manual inspection, we bold part of the third sub-response causing the incorrectness, cascading to subsequent sub-responses. The e-scores on the bottom right of each block are our proposed measures of incorrectness: low for correct and high for incorrect responses.
  • Figure 2: Scores for mathematical factuality. We use the setting in \ref{['subsec:experiments-processbench']} to compare our proposed e-scores (in orange) against p-scores (in blue). The left graphs plot size distortion (cf. \ref{['eq:size_distortion']}). The center graphs plot mean error vs. mean $\alpha$ (where the dashed black line is the identity line). The right graphs plot mean precision vs. mean recall (with the e-scores curves overlapping and hiding part of the p-scores curves).
  • Figure 3: Scores for property constraints satisfaction. We use the setting in \ref{['subsec:experiments-ultrafeedback']} to compare our proposed e-scores (in orange) against p-scores (in blue). We consider different max-constrained adaptive $\alpha$ strategies, setting $\alpha_{\max} = 0 , .01 , .02 , \ldots , .99 , 1$ (cf. \ref{['subsec:alpha_strategies']}). The left graphs plot size distortion (cf. \ref{['eq:size_distortion']}). The center graphs plot mean error vs. mean $\alpha$ (where the dashed black line is the identity line). The right graphs plot mean precision vs. mean recall (with the e-scores curves overlapping and hiding part of the p-scores curves).

Theorems & Definitions (2)

  • Theorem 1
  • proof