AutoEval Done Right: Using Synthetic Data for Model Evaluation

Pierre Boyeau; Anastasios N. Angelopoulos; Nir Yosef; Jitendra Malik; Michael I. Jordan

AutoEval Done Right: Using Synthetic Data for Model Evaluation

Pierre Boyeau, Anastasios N. Angelopoulos, Nir Yosef, Jitendra Malik, Michael I. Jordan

TL;DR

AutoEval introduces a principled framework for evaluating machine learning systems with few human labels by leveraging AI-generated synthetic labels on a large unlabeled corpus. It uses prediction-powered inference (PPI and PPI++) to debias synthetic data and reduce estimator variance, delivering unbiased estimates and confidence intervals. The approach yields substantial gains in effective sample size (up to ~50%) across tasks such as ImageNet accuracy, protein fitness prediction, and LLM pairwise rankings, and it provides practical tools for both metric estimation and pairwise comparison evaluation. By enabling scalable, low-cost, and statistically valid evaluation, AutoEval offers a versatile alternative to exhaustive human annotation while acknowledging limitations related to distribution shifts and annotator bias.

Abstract

The evaluation of machine learning models using human-labeled validation data can be expensive and time-consuming. AI-labeled synthetic data can be used to decrease the number of human annotations required for this purpose in a process called autoevaluation. We suggest efficient and statistically principled algorithms for this purpose that improve sample efficiency while remaining unbiased. These algorithms increase the effective human-labeled sample size by up to 50% on experiments with GPT-4.

AutoEval Done Right: Using Synthetic Data for Model Evaluation

TL;DR

Abstract

Paper Structure (25 sections, 16 equations, 9 figures, 1 table)

This paper contains 25 sections, 16 equations, 9 figures, 1 table.

Introduction
Related Work
Autoevaluating Accuracy and other Metrics
Defining the Goal
The Algorithm
Application to Rank computer Vision Models
Application to Evaluate Protein Fitness Prediction Models
Evaluating Model Performance from Pairwise Comparisons
A Model to Assess Relative Performance
Autoevaluation of Relative Performance
Autoevaluation of LLMs from Pairwise Preferences
Discussion
Limitations
Broader Impacts
Experimental details
...and 10 more sections

Figures (9)

Figure 1: Eff. sample sizes of our approach vs. a classical test to infer the average win rate of gpt-3.5-turbo against other LLMs in the Chatbot Arena chatzi2024prediction.
Figure 2: Python code to produce CIs and point estimates for model accuracy. The variable meanings are explained in the code comments.
Figure 3: ImageNet experiment. For every approach, we built confidence intervals around the average accuracy of different ResNet architectures. a. MSE of the point estimates of the model accuracies. b. ESS of PPI and PPI++ against the classical approach. c. Correlation between the estimated and true model rankings. Here, and in all following figures, obtained metrics are averaged across 250 random splits of the validation data into labeled and unlabeled data.
Figure 4: Protein fitness experiment for building confidence intervals and point estimates for the Pearson correlation of seven protein language models with the experimental fitness scores, using a held-out model to produce synthetic labels. a. MSE of the point estimates of the model correlations. b. ESS of PPI and PPI++ against the classical approach. c. Correlation between the estimated and true model rankings.
Figure 5: ESS of PPI++ against annotator model performance for $n=500$ labeled points. The horizontal line denotes the ESS of classical.
...and 4 more figures

AutoEval Done Right: Using Synthetic Data for Model Evaluation

TL;DR

Abstract

AutoEval Done Right: Using Synthetic Data for Model Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)