Table of Contents
Fetching ...

Adaptive Prediction-Powered AutoEval with Reliability and Efficiency Guarantees

Sangwoo Park, Matteo Zecchin, Osvaldo Simeone

TL;DR

The paper tackles the problem of reliably evaluating AI model performance with limited real-world data by addressing biases in autoevaluation. It introduces R-AutoEval+, an adaptive framework that selectively leverages synthetic data through a family of reliance factors and e-value based testing, providing finite-sample reliability guarantees and improved sample efficiency over prior methods. The approach combines adaptive effective observations with PPI++ bias correction, offering theoretical guarantees on sample complexity and practical validation across LLM quantization, prompting, and reasoning-budget tasks. This work enables scalable, reliable model evaluation in real-world settings where unlabeled data and autoevaluators are available but real labels are costly.

Abstract

Selecting artificial intelligence (AI) models, such as large language models (LLMs), from multiple candidates requires accurate performance estimation. This is ideally achieved through empirical evaluations involving abundant real-world data. However, such evaluations are costly and impractical at scale. To address this challenge, autoevaluation methods leverage synthetic data produced by automated evaluators, such as LLMs-as-judges, reducing variance but potentially introducing bias. Recent approaches have employed semi-supervised prediction-powered inference (PPI) to correct for the bias of autoevaluators. However, the use of autoevaluators may lead in practice to a degradation in sample efficiency compared to conventional methods using only real-world data. In this paper, we propose R-AutoEval+, a novel framework that provides finite-sample reliability guarantees on the model evaluation, while also ensuring an enhanced (or at least no worse) sample efficiency compared to conventional methods. The key innovation of R-AutoEval+ is an adaptive construction of the model evaluation variable, which dynamically tunes its reliance on synthetic data, reverting to conventional methods when the autoevaluator is insufficiently accurate. Experiments on the use of LLMs-as-judges for the optimization of quantization settings for the weights of an LLM, for prompt design in LLMs, and for test-time reasoning budget allocation in LLMs confirm the reliability and efficiency of R-AutoEval+.

Adaptive Prediction-Powered AutoEval with Reliability and Efficiency Guarantees

TL;DR

The paper tackles the problem of reliably evaluating AI model performance with limited real-world data by addressing biases in autoevaluation. It introduces R-AutoEval+, an adaptive framework that selectively leverages synthetic data through a family of reliance factors and e-value based testing, providing finite-sample reliability guarantees and improved sample efficiency over prior methods. The approach combines adaptive effective observations with PPI++ bias correction, offering theoretical guarantees on sample complexity and practical validation across LLM quantization, prompting, and reasoning-budget tasks. This work enables scalable, reliable model evaluation in real-world settings where unlabeled data and autoevaluators are available but real labels are costly.

Abstract

Selecting artificial intelligence (AI) models, such as large language models (LLMs), from multiple candidates requires accurate performance estimation. This is ideally achieved through empirical evaluations involving abundant real-world data. However, such evaluations are costly and impractical at scale. To address this challenge, autoevaluation methods leverage synthetic data produced by automated evaluators, such as LLMs-as-judges, reducing variance but potentially introducing bias. Recent approaches have employed semi-supervised prediction-powered inference (PPI) to correct for the bias of autoevaluators. However, the use of autoevaluators may lead in practice to a degradation in sample efficiency compared to conventional methods using only real-world data. In this paper, we propose R-AutoEval+, a novel framework that provides finite-sample reliability guarantees on the model evaluation, while also ensuring an enhanced (or at least no worse) sample efficiency compared to conventional methods. The key innovation of R-AutoEval+ is an adaptive construction of the model evaluation variable, which dynamically tunes its reliance on synthetic data, reverting to conventional methods when the autoevaluator is insufficiently accurate. Experiments on the use of LLMs-as-judges for the optimization of quantization settings for the weights of an LLM, for prompt design in LLMs, and for test-time reasoning budget allocation in LLMs confirm the reliability and efficiency of R-AutoEval+.

Paper Structure

This paper contains 40 sections, 6 theorems, 48 equations, 10 figures, 7 tables, 1 algorithm.

Key Result

Theorem 1

Under mild regularity assumptions, for sufficiently low tolerated unreliability level $\delta$, R-AutoEval+ is provably more sample efficient than both R-Evalwaudby2024estimating and R-AutoEvaleinbinder2024semi, i.e., Furthermore, this inequality is strict when the autoevaluator is sufficiently accurate.

Figures (10)

  • Figure 1: How to select the lightest quantized Llama-3.1-8B-Instruct model grattafiori2024llama (in the MX quantization format rouhani2023microscaling) that guarantees up to $10 \%$ performance drop as compared to the unquantized version (BF16) (for the TriviaQA task joshi2017triviaqa)? (left) Ground-truth risk $R$ for different MX quantization settings, requiring massive human-labeled data. (right) Performance drop and corresponding model size for the models chosen via Eval, AutoEvalnovikova2017weliu2016notzheng2023judging, R-Evalwaudby2024estimating, R-AutoEvaleinbinder2024semi, and the proposed R-AutoEval+. We adopt Llama-3.3-70B-Instruct grattafiori2024llama BF16/MX6/MX4 as the autoevaluators, and set target risk in (\ref{['eq:goal']}) to $\alpha=0.1$ and target reliability in (\ref{['eq:type_1']}) to $1-\delta=0.9$. Maximum values are reported within the $1.5$ interquartile range (IQR) range mcgill1978variations across $500$ independent experiments (see Sec. \ref{['sec:experiments']} for details).
  • Figure 2: Heatmap of the evolution of the weights $\{w_{s,i}\}_{s=1}^{100}$ assigned to the factors $\rho_s$ as a function of the processing round $i$ for Example \ref{['example']}. The autoevaluator reports the correct loss with probability $\gamma=0.99$ (top), $\gamma=0.9$ (middle), and $\gamma=0.7$ (bottom). R-AutoEval+ assigns larger weights to synthetic data, i.e., to larger values of $\rho_s$, when the autoevaluator is of higher quality.
  • Figure 3: Sample complexity as a function of $\log(1/\delta)$ (top) and maximum expected increment of the log-e-value $g_{s,\star}$ as a function of $\rho_s$ (bottom) for Example \ref{['example_2']}.
  • Figure 4: Risk-controlling model selection using R-Evalwaudby2024estimatingbates2021distributionangelopoulos2021learn, R-AutoEvaleinbinder2024semi, and the proposed R-AutoEval+ for the problems of (a) selecting the lightest quantized LLM with guaranteed performance drop on the TriviaQA data set joshi2017triviaqa, and (b) selecting the shortest prompt template with guaranteed execution accuracy on the Instruction-Induction task honovich2022instruction. (a, top) Size of the smallest selected model versus the number $n$ of real-world data points, and (a, bottom) corresponding performance drop. (b, top) Length of the shortest selected prompt template versus the number of in-context samples used by the autoevaluator, and (b, bottom) corresponding complement of the execution accuracy.
  • Figure 5: Same setting with Figs. \ref{['fig:portfolio']} and \ref{['fig:toy_theory']} but with WSR betting instead of UP betting.
  • ...and 5 more figures

Theorems & Definitions (11)

  • Theorem 1: Informal
  • Example 1
  • Theorem 2: Sample complexity of testing-by-betting (\ref{['eq:basic_E_n']}) waudby2025universal
  • Lemma 1: Sublinear regret for the weights (\ref{['eq:rebalance']})
  • Theorem 3: Sample complexity of R-AutoEval+
  • proof
  • Example 2
  • proof
  • Lemma 2: R-Eval's suboptimality ratio waudby2025universal
  • Lemma 3: $E_n^\texttt{+}$'s suboptimality ratio
  • ...and 1 more