Table of Contents
Fetching ...

Automating Benchmark Design

Amanda Dsouza, Harit Vishwakarma, Zhengyang Qi, Justin Bauer, Derek Pham, Thomas Walshe, Armin Parchami, Frederic Sala, Paroma Varma

TL;DR

BeTaL tackles the evaluation bottleneck for rapidly advancing LLMs by introducing Benchmark Tuning with an LLM-in-the-loop, an automated framework that parameterizes base benchmarks and uses iterative LLM-driven design guided by real model feedback to reach target difficulty. By formalizing the design space as an optimization problem and employing a closed feedback loop (parameter generation, environment instantiation, evaluation, and refinement), BeTaL achieves significantly lower deviations from target difficulty (5.3%–13.2%) and outperforms strong baselines by 2–4x across arithmetic, spatial reasoning, and agentic benchmark domains. The approach demonstrates transferability of BeTaL-designed benchmarks across evaluation models and highlights that LLMs must be used iteratively rather than relying on single-round prompts or chain-of-thought prompts alone. It also reveals limitations in AI-generated parameter spaces and outlines future work on multi-objective optimization and human-in-the-loop enhancements to broaden applicability and reliability.

Abstract

The rapid progress and widespread deployment of LLMs and LLM-powered agents has outpaced our ability to evaluate them. Hand-crafted, static benchmarks are the primary tool for assessing model capabilities, but these quickly become saturated. In contrast, dynamic benchmarks evolve alongside the models they evaluate, but are expensive to create and continuously update. To address these challenges, we develop BeTaL (Benchmark Tuning with an LLM-in-the-loop), a framework that leverages environment design principles to automate the process of dynamic benchmark design. BeTaL works by parameterizing key design choices in base benchmark templates and uses LLMs to reason through the resulting parameter space to obtain target properties (such as difficulty and realism) in a cost-efficient manner. We validate this approach on its ability to create benchmarks with desired difficulty levels. Using BeTaL, we create two new benchmarks and extend a popular agentic benchmark $τ$-bench. Extensive evaluation on these three tasks and multiple target difficulty levels shows that BeTaL produces benchmarks much closer to the desired difficulty, with average deviations ranging from 5.3% to 13.2% -- a 2-4x improvement over the baselines.

Automating Benchmark Design

TL;DR

BeTaL tackles the evaluation bottleneck for rapidly advancing LLMs by introducing Benchmark Tuning with an LLM-in-the-loop, an automated framework that parameterizes base benchmarks and uses iterative LLM-driven design guided by real model feedback to reach target difficulty. By formalizing the design space as an optimization problem and employing a closed feedback loop (parameter generation, environment instantiation, evaluation, and refinement), BeTaL achieves significantly lower deviations from target difficulty (5.3%–13.2%) and outperforms strong baselines by 2–4x across arithmetic, spatial reasoning, and agentic benchmark domains. The approach demonstrates transferability of BeTaL-designed benchmarks across evaluation models and highlights that LLMs must be used iteratively rather than relying on single-round prompts or chain-of-thought prompts alone. It also reveals limitations in AI-generated parameter spaces and outlines future work on multi-objective optimization and human-in-the-loop enhancements to broaden applicability and reliability.

Abstract

The rapid progress and widespread deployment of LLMs and LLM-powered agents has outpaced our ability to evaluate them. Hand-crafted, static benchmarks are the primary tool for assessing model capabilities, but these quickly become saturated. In contrast, dynamic benchmarks evolve alongside the models they evaluate, but are expensive to create and continuously update. To address these challenges, we develop BeTaL (Benchmark Tuning with an LLM-in-the-loop), a framework that leverages environment design principles to automate the process of dynamic benchmark design. BeTaL works by parameterizing key design choices in base benchmark templates and uses LLMs to reason through the resulting parameter space to obtain target properties (such as difficulty and realism) in a cost-efficient manner. We validate this approach on its ability to create benchmarks with desired difficulty levels. Using BeTaL, we create two new benchmarks and extend a popular agentic benchmark -bench. Extensive evaluation on these three tasks and multiple target difficulty levels shows that BeTaL produces benchmarks much closer to the desired difficulty, with average deviations ranging from 5.3% to 13.2% -- a 2-4x improvement over the baselines.

Paper Structure

This paper contains 24 sections, 2 equations, 15 figures, 3 tables, 1 algorithm.

Figures (15)

  • Figure 1: BeTaL automates the process of designing and adjusting dynamic benchmarks to meet target criteria.
  • Figure 2: Evaluation results on o4-mini with BeTaL (with GPT-5 as the designer model, and o4-mini as the target model during parameter search) perform robustly at different target difficulty levels, compared to baselines on Arithmetic Sequences, Spatial Reasoning, and $\tau$-Bench. A similar performance is noted using Claude Opus 4.1 and Grok-4 as Designers, in Figure \ref{['fig:avg_gap_merged']} in the Appendix.
  • Figure 3: Convergence of iterative methods during parameter selection on Spatial Reasoning and $\tau$-Bench benchmarks: BeTaL vs. RS+PPR. Performance gap of BeTaL shrinks faster compared to RS+PPR, within 10 iterations, indicating LLMs are more efficient than competing iterative methods at finding favorable environment parameters for benchmark creation. Results are averaged over difficulty levels and designer models.
  • Figure 4: Evaluation generalization across designer models and datasets. Observed versus target accuracy for o4-mini target trained by different designers (columns: GPT-5, Grok-4, Opus-4.1) on three benchmarks (rows: Arithmetic Sequence, Spatial Reasoning, $\tau$-Bench). The black dashed line indicates perfect alignment.
  • Figure 5: Results on different evaluation models. The left figure shows aggregate results for all methods, and the right figure focuses on BeTaL's results, showing the observed accuracies at different target difficulty levels. All results are averaged across Designer Models.
  • ...and 10 more figures