AutoBencher: Towards Declarative Benchmark Construction

Xiang Lisa Li; Farzaan Kaiyom; Evan Zheran Liu; Yifan Mai; Percy Liang; Tatsunori Hashimoto

AutoBencher: Towards Declarative Benchmark Construction

Xiang Lisa Li, Farzaan Kaiyom, Evan Zheran Liu, Yifan Mai, Percy Liang, Tatsunori Hashimoto

TL;DR

AutoBencher tackles the challenge of evaluating language models by automatically constructing evaluation datasets through a declarative optimization framework. It defines capability and safety desiderata and uses a GPT-4-based evaluator to propose privileged-information-grounded dataset descriptions, generating questions answered by candidate LMs. The approach yields datasets that are more novel, difficult, and separable than human-crafted benchmarks and reveals new model weaknesses and safety vulnerabilities. In experiments across math, knowledge, multilinguality, history, economics, science, and safety, AutoBencher yields about $22\%$ more model errors and $27\%$ lower ranking correlation with existing benchmarks, while safety datasets increase attack success rate by about $20\%$, demonstrating scalable, automated discovery of evaluation weaknesses and risks. This scalable benchmark construction has potential to accelerate evaluation and discovery while motivating careful human validation and broader desiderata such as diversity and coverage.

Abstract

We present AutoBencher, a declarative framework for automatic benchmark construction, and use it to scalably discover novel insights and vulnerabilities of existing language models. Concretely, given a few desiderata of benchmarks (e.g., question difficulty, topic salience), we operationalize each desideratum and cast benchmark creation as an optimization problem. Specifically, we experiment with two settings with different optimization objectives: (i) for capability evaluation, we declare the goal of finding a salient, difficult dataset that induces novel performance patterns; (ii) for safety evaluation, we declare the goal of finding a dataset of unsafe prompts that existing LMs fail to decline. To tackle this optimization problem, we use a language model to iteratively propose and refine dataset descriptions, which are then used to generate topic-specific questions and answers. These descriptions are optimized to improve the declared desiderata. We use AutoBencher (powered by GPT-4) to create datasets for math, multilinguality, knowledge, and safety. The scalability of AutoBencher allows it to test fine-grained categories and tail knowledge, creating datasets that elicit 22% more model errors (i.e., difficulty) than existing benchmarks. On the novelty ends, AutoBencher also helps identify specific gaps not captured by existing benchmarks: e.g., Gemini-Pro has knowledge gaps on Permian Extinction and Fordism while GPT-4o fails to decline harmful requests about cryptocurrency scams.

AutoBencher: Towards Declarative Benchmark Construction

TL;DR

more model errors and

lower ranking correlation with existing benchmarks, while safety datasets increase attack success rate by about

, demonstrating scalable, automated discovery of evaluation weaknesses and risks. This scalable benchmark construction has potential to accelerate evaluation and discovery while motivating careful human validation and broader desiderata such as diversity and coverage.

Abstract

Paper Structure (39 sections, 4 equations, 9 figures, 13 tables, 1 algorithm)

This paper contains 39 sections, 4 equations, 9 figures, 13 tables, 1 algorithm.

Introduction
Related Work
A Declarative Framework of Benchmark Creation
Capability Evaluation
Safety Evaluation
Solving the Optimization Problem
Generating Datasets with Privileged Information
Proposing Topics with Adaptive Search
Experimental Setup
Baselines and Metrics
AutoBencher Hyperparameters and Costs
Main Results
Capability Settings: Novelty, Difficulty, Separability
The Safety Setting: Attack Success Rate
Qualitative Examples
...and 24 more sections

Figures (9)

Figure 1: (Left) A toy example of model rankings on existing datasets and AutoBencher datasets. Existing datasets show roughly the same performance trends, while AutoBencher discovers tests that induce novel rankings. (Right) Given a domain (e.g., history), AutoBencher creates datasets that are salient, difficult, and novel. It achieves this by searching over dataset descriptions (e.g., the timeline of WWII), scoring each based on difficulty and novelty, and selecting the best one.
Figure 2: How the model $\texttt{LM}_\text{evaluator}$ uses privileged information to create (question, answer) examples.
Figure 3: The standard deviation of the three metrics: novelty, separability and difficulty as a function of dataset size.
Figure 4: Annotation guideline for salience judgment on Amazon Mechanical Turk.
Figure 5: Search trajectories of AutoBencher (history) with different $\texttt{LM}_\text{candidate}$. It shows the evaluation topics that are explored and their respective accuracy as a star plot.
...and 4 more figures

AutoBencher: Towards Declarative Benchmark Construction

TL;DR

Abstract

AutoBencher: Towards Declarative Benchmark Construction

Authors

TL;DR

Abstract

Table of Contents

Figures (9)