Adaptive Prediction-Powered AutoEval with Reliability and Efficiency Guarantees
Sangwoo Park, Matteo Zecchin, Osvaldo Simeone
TL;DR
The paper tackles the problem of reliably evaluating AI model performance with limited real-world data by addressing biases in autoevaluation. It introduces R-AutoEval+, an adaptive framework that selectively leverages synthetic data through a family of reliance factors and e-value based testing, providing finite-sample reliability guarantees and improved sample efficiency over prior methods. The approach combines adaptive effective observations with PPI++ bias correction, offering theoretical guarantees on sample complexity and practical validation across LLM quantization, prompting, and reasoning-budget tasks. This work enables scalable, reliable model evaluation in real-world settings where unlabeled data and autoevaluators are available but real labels are costly.
Abstract
Selecting artificial intelligence (AI) models, such as large language models (LLMs), from multiple candidates requires accurate performance estimation. This is ideally achieved through empirical evaluations involving abundant real-world data. However, such evaluations are costly and impractical at scale. To address this challenge, autoevaluation methods leverage synthetic data produced by automated evaluators, such as LLMs-as-judges, reducing variance but potentially introducing bias. Recent approaches have employed semi-supervised prediction-powered inference (PPI) to correct for the bias of autoevaluators. However, the use of autoevaluators may lead in practice to a degradation in sample efficiency compared to conventional methods using only real-world data. In this paper, we propose R-AutoEval+, a novel framework that provides finite-sample reliability guarantees on the model evaluation, while also ensuring an enhanced (or at least no worse) sample efficiency compared to conventional methods. The key innovation of R-AutoEval+ is an adaptive construction of the model evaluation variable, which dynamically tunes its reliance on synthetic data, reverting to conventional methods when the autoevaluator is insufficiently accurate. Experiments on the use of LLMs-as-judges for the optimization of quantization settings for the weights of an LLM, for prompt design in LLMs, and for test-time reasoning budget allocation in LLMs confirm the reliability and efficiency of R-AutoEval+.
