E-Scores for (In)Correctness Assessment of Generative Model Outputs
Guneet S. Dhillon, Javier González, Teodora Pandeva, Alicia Curth
TL;DR
This work addresses the challenge of principled, post-hoc assessment of the (in)correctness of generative model outputs, particularly for LLMs, by moving from p-value based conformal methods to e-scores derived from e-values. The authors define a scalable framework that allows adaptive, data-driven post-hoc tolerance levels while guaranteeing a size distortion bound, formalized as $\mathbb{E}\left[ \frac{\mathbf{1}\{\exists(\mathbf{Y},\cdot)\in \mathds{S}_{\alpha}(X,g_{\pi}(X)) : o(X,\mathbf{Y})=0\}}{\alpha(X,\mathds{S}(X,g_{\pi}(X)))} \right] \le 1$ under exchangeability, and implement e-scores as the reciprocal of appropriately defined e-values. The method leverages a calibrated oracle estimator to proxy correctness, combines multiple e-scores for robustness, and demonstrates efficacy on two key tasks: mathematical factuality and property-constrained output (e.g., instruction-following, truthfulness). Empirical results show e-scores upper bound size distortion and maintain competitive precision-recall relative to p-scores, with advantages in memory/time efficiency. The theoretical and experimental findings indicate that e-scores enable reliable, post-hoc control over the (in)correctness of large-scale generated outputs across diverse use-cases and calibration data settings, paving the way for practical, robust evaluation of generative systems.
Abstract
While generative models, especially large language models (LLMs), are ubiquitous in today's world, principled mechanisms to assess their (in)correctness are limited. Using the conformal prediction framework, previous works construct sets of LLM responses where the probability of including an incorrect response, or error, is capped at a desired user-defined tolerance level. However, since these methods are based on p-values, they are susceptible to p-hacking, i.e., choosing the tolerance level post-hoc can invalidate the guarantees. We therefore leverage e-values to complement generative model outputs with e-scores as a measure of incorrectness. In addition to achieving the same statistical guarantees as before, e-scores provide users flexibility in adaptively choosing tolerance levels after observing the e-scores themselves, by upper bounding a post-hoc notion of error called size distortion. We experimentally demonstrate their efficacy in assessing LLM outputs for different correctness types: mathematical factuality and property constraints satisfaction.
