Table of Contents
Fetching ...

BenchAgents: Multi-Agent Systems for Structured Benchmark Creation

Natasha Butt, Varun Chandrasekaran, Neel Joshi, Besmira Nushi, Vidhisha Balachandran

TL;DR

BenchAgents tackles the benchmark bottleneck for evaluating evolving generative capabilities by introducing a four-agent, LLM-driven pipeline (Planning, Data Generation, Verification, Evaluation) that automates high-quality benchmark creation with developer oversight. It demonstrates three benchmarks—BA-Calendar, BA-Text, and BA-Causal—spanning planning, constraint satisfaction, and visual causal reasoning, and provides fine-grained analyses of state-of-the-art models. The work shows that reasoning-enabled models excel on planning and constrained generation while negation and visual perception remain challenging, and it establishes a scalable, controllable framework for open-ended evaluation yet with strong quality assurance. Overall, BenchAgents reduces manual effort and bias while enabling systematic diagnostics of model strengths and failure modes across language and vision tasks.

Abstract

Evaluation insights are limited by the availability of high-quality benchmarks. As models evolve, there is a need to create benchmarks that can measure progress on new and complex generative capabilities. However, manually creating new benchmarks is slow and expensive, restricting comprehensive evaluations for any capability. We introduce BenchAgents, a multi-agent framework that methodically leverages large language models (LLMs) to automate evaluation benchmark creation while inherently ensuring data and (evaluation) metric quality. BenchAgents decomposes the benchmark creation process into planning, generation, verification, and evaluation, each of which is ] orchestrated via LLM agents. These agents interact with each other and utilize feedback from benchmark developers to improve and flexibly control data diversity and quality. We use BenchAgents to create benchmarks to evaluate capabilities related to planning, constraint satisfaction, and causal reasoning spanning both language and vision modalities. We then use these benchmarks to study state-of-the-art models and extract new insights into common failure modes and model differences.

BenchAgents: Multi-Agent Systems for Structured Benchmark Creation

TL;DR

BenchAgents tackles the benchmark bottleneck for evaluating evolving generative capabilities by introducing a four-agent, LLM-driven pipeline (Planning, Data Generation, Verification, Evaluation) that automates high-quality benchmark creation with developer oversight. It demonstrates three benchmarks—BA-Calendar, BA-Text, and BA-Causal—spanning planning, constraint satisfaction, and visual causal reasoning, and provides fine-grained analyses of state-of-the-art models. The work shows that reasoning-enabled models excel on planning and constrained generation while negation and visual perception remain challenging, and it establishes a scalable, controllable framework for open-ended evaluation yet with strong quality assurance. Overall, BenchAgents reduces manual effort and bias while enabling systematic diagnostics of model strengths and failure modes across language and vision tasks.

Abstract

Evaluation insights are limited by the availability of high-quality benchmarks. As models evolve, there is a need to create benchmarks that can measure progress on new and complex generative capabilities. However, manually creating new benchmarks is slow and expensive, restricting comprehensive evaluations for any capability. We introduce BenchAgents, a multi-agent framework that methodically leverages large language models (LLMs) to automate evaluation benchmark creation while inherently ensuring data and (evaluation) metric quality. BenchAgents decomposes the benchmark creation process into planning, generation, verification, and evaluation, each of which is ] orchestrated via LLM agents. These agents interact with each other and utilize feedback from benchmark developers to improve and flexibly control data diversity and quality. We use BenchAgents to create benchmarks to evaluate capabilities related to planning, constraint satisfaction, and causal reasoning spanning both language and vision modalities. We then use these benchmarks to study state-of-the-art models and extract new insights into common failure modes and model differences.

Paper Structure

This paper contains 25 sections, 18 figures, 15 tables.

Figures (18)

  • Figure 1: Overview of BenchAgents. P-Agent generates a plan for data generation and communicates this to G-Agent. G-Agent writes code for data generation and communicates to all agents. P-Agent generates plans for evaluation and verification and communicates these to the respective agents. E-Agent and V-Agent write code for evaluation and verification. For each instance, generation, verification and evaluation are run.
  • Figure 2: BA-Calendar: (a) fraction passed and pass all accuracy; (b) pass all accuracy vs. constrainedness (c) rate of 'no solution' outcomes (lower is better for feasible, higher for infeasible); (d, e) pass rate for a given constraint(s) for task instances where constraints are applied
  • Figure 3: Distribution of complexity metric (bar) and GPT-4o average pass-all (line) across all tasks.
  • Figure 4: BA-Text: (a) shows fraction passed and pass all accuracy; (c) and (d) show pass rate for instances with specific constraint(s)
  • Figure 5: BA-Causal: (a) overall accuracy; accuracy by (b) number of objects, (c) material diversity, (e) complexity, and (f) shape diversity; (d) compares text-only vs. original image task.
  • ...and 13 more figures