Automating Benchmark Design
Amanda Dsouza, Harit Vishwakarma, Zhengyang Qi, Justin Bauer, Derek Pham, Thomas Walshe, Armin Parchami, Frederic Sala, Paroma Varma
TL;DR
BeTaL tackles the evaluation bottleneck for rapidly advancing LLMs by introducing Benchmark Tuning with an LLM-in-the-loop, an automated framework that parameterizes base benchmarks and uses iterative LLM-driven design guided by real model feedback to reach target difficulty. By formalizing the design space as an optimization problem and employing a closed feedback loop (parameter generation, environment instantiation, evaluation, and refinement), BeTaL achieves significantly lower deviations from target difficulty (5.3%–13.2%) and outperforms strong baselines by 2–4x across arithmetic, spatial reasoning, and agentic benchmark domains. The approach demonstrates transferability of BeTaL-designed benchmarks across evaluation models and highlights that LLMs must be used iteratively rather than relying on single-round prompts or chain-of-thought prompts alone. It also reveals limitations in AI-generated parameter spaces and outlines future work on multi-objective optimization and human-in-the-loop enhancements to broaden applicability and reliability.
Abstract
The rapid progress and widespread deployment of LLMs and LLM-powered agents has outpaced our ability to evaluate them. Hand-crafted, static benchmarks are the primary tool for assessing model capabilities, but these quickly become saturated. In contrast, dynamic benchmarks evolve alongside the models they evaluate, but are expensive to create and continuously update. To address these challenges, we develop BeTaL (Benchmark Tuning with an LLM-in-the-loop), a framework that leverages environment design principles to automate the process of dynamic benchmark design. BeTaL works by parameterizing key design choices in base benchmark templates and uses LLMs to reason through the resulting parameter space to obtain target properties (such as difficulty and realism) in a cost-efficient manner. We validate this approach on its ability to create benchmarks with desired difficulty levels. Using BeTaL, we create two new benchmarks and extend a popular agentic benchmark $τ$-bench. Extensive evaluation on these three tasks and multiple target difficulty levels shows that BeTaL produces benchmarks much closer to the desired difficulty, with average deviations ranging from 5.3% to 13.2% -- a 2-4x improvement over the baselines.
