BenchAgents: Multi-Agent Systems for Structured Benchmark Creation
Natasha Butt, Varun Chandrasekaran, Neel Joshi, Besmira Nushi, Vidhisha Balachandran
TL;DR
BenchAgents tackles the benchmark bottleneck for evaluating evolving generative capabilities by introducing a four-agent, LLM-driven pipeline (Planning, Data Generation, Verification, Evaluation) that automates high-quality benchmark creation with developer oversight. It demonstrates three benchmarks—BA-Calendar, BA-Text, and BA-Causal—spanning planning, constraint satisfaction, and visual causal reasoning, and provides fine-grained analyses of state-of-the-art models. The work shows that reasoning-enabled models excel on planning and constrained generation while negation and visual perception remain challenging, and it establishes a scalable, controllable framework for open-ended evaluation yet with strong quality assurance. Overall, BenchAgents reduces manual effort and bias while enabling systematic diagnostics of model strengths and failure modes across language and vision tasks.
Abstract
Evaluation insights are limited by the availability of high-quality benchmarks. As models evolve, there is a need to create benchmarks that can measure progress on new and complex generative capabilities. However, manually creating new benchmarks is slow and expensive, restricting comprehensive evaluations for any capability. We introduce BenchAgents, a multi-agent framework that methodically leverages large language models (LLMs) to automate evaluation benchmark creation while inherently ensuring data and (evaluation) metric quality. BenchAgents decomposes the benchmark creation process into planning, generation, verification, and evaluation, each of which is ] orchestrated via LLM agents. These agents interact with each other and utilize feedback from benchmark developers to improve and flexibly control data diversity and quality. We use BenchAgents to create benchmarks to evaluate capabilities related to planning, constraint satisfaction, and causal reasoning spanning both language and vision modalities. We then use these benchmarks to study state-of-the-art models and extract new insights into common failure modes and model differences.
