AutoBencher: Towards Declarative Benchmark Construction
Xiang Lisa Li, Farzaan Kaiyom, Evan Zheran Liu, Yifan Mai, Percy Liang, Tatsunori Hashimoto
TL;DR
AutoBencher tackles the challenge of evaluating language models by automatically constructing evaluation datasets through a declarative optimization framework. It defines capability and safety desiderata and uses a GPT-4-based evaluator to propose privileged-information-grounded dataset descriptions, generating questions answered by candidate LMs. The approach yields datasets that are more novel, difficult, and separable than human-crafted benchmarks and reveals new model weaknesses and safety vulnerabilities. In experiments across math, knowledge, multilinguality, history, economics, science, and safety, AutoBencher yields about $22\%$ more model errors and $27\%$ lower ranking correlation with existing benchmarks, while safety datasets increase attack success rate by about $20\%$, demonstrating scalable, automated discovery of evaluation weaknesses and risks. This scalable benchmark construction has potential to accelerate evaluation and discovery while motivating careful human validation and broader desiderata such as diversity and coverage.
Abstract
We present AutoBencher, a declarative framework for automatic benchmark construction, and use it to scalably discover novel insights and vulnerabilities of existing language models. Concretely, given a few desiderata of benchmarks (e.g., question difficulty, topic salience), we operationalize each desideratum and cast benchmark creation as an optimization problem. Specifically, we experiment with two settings with different optimization objectives: (i) for capability evaluation, we declare the goal of finding a salient, difficult dataset that induces novel performance patterns; (ii) for safety evaluation, we declare the goal of finding a dataset of unsafe prompts that existing LMs fail to decline. To tackle this optimization problem, we use a language model to iteratively propose and refine dataset descriptions, which are then used to generate topic-specific questions and answers. These descriptions are optimized to improve the declared desiderata. We use AutoBencher (powered by GPT-4) to create datasets for math, multilinguality, knowledge, and safety. The scalability of AutoBencher allows it to test fine-grained categories and tail knowledge, creating datasets that elicit 22% more model errors (i.e., difficulty) than existing benchmarks. On the novelty ends, AutoBencher also helps identify specific gaps not captured by existing benchmarks: e.g., Gemini-Pro has knowledge gaps on Permian Extinction and Fordism while GPT-4o fails to decline harmful requests about cryptocurrency scams.
