Table of Contents
Fetching ...

MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains

Guoli Yin, Haoping Bai, Shuang Ma, Feng Nan, Yanchao Sun, Zhaoyang Xu, Shen Ma, Jiarui Lu, Xiang Kong, Aonan Zhang, Dian Ang Yap, Yizhe zhang, Karsten Ahnert, Vik Kamath, Mathias Berglund, Dominic Walsh, Tobias Gindele, Juergen Wiest, Zhengfeng Lai, Xiaoming Wang, Jiulong Shan, Meng Cao, Ruoming Pang, Zirui Wang

TL;DR

MMAU presents a holistic offline benchmark to evaluate LLM agents across $5$ domains and $5$ core capabilities, using $3{,}220$ prompts spanning $64$ subjects and $20$ tasks across a static dataset evaluated on $18$ representative models. It decomposes evaluation into Understanding, Reasoning, Planning, Problem-solving, and Self-correction, offering domain-centric and capability-centric analyses to illuminate strengths and gaps beyond traditional task-completion benchmarks. The benchmark integrates five data sources and tasks, including Tool-use, DAG QA, DS/ML coding, Contest-level coding, and Mathematics, enabling granular diagnostics of agent behavior. By providing dataset access and evaluation scripts, MMAU aims to improve interpretability, reliability, and comparability of LLM agent performance while complementing interactive benchmarks for a more complete assessment of agent capabilities.

Abstract

Recent advances in large language models (LLMs) have increased the demand for comprehensive benchmarks to evaluate their capabilities as human-like agents. Existing benchmarks, while useful, often focus on specific application scenarios, emphasizing task completion but failing to dissect the underlying skills that drive these outcomes. This lack of granularity makes it difficult to deeply discern where failures stem from. Additionally, setting up these environments requires considerable effort, and issues of unreliability and reproducibility sometimes arise, especially in interactive tasks. To address these limitations, we introduce the Massive Multitask Agent Understanding (MMAU) benchmark, featuring comprehensive offline tasks that eliminate the need for complex environment setups. It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics, and covers five essential capabilities: Understanding, Reasoning, Planning, Problem-solving, and Self-correction. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents. By testing 18 representative models on MMAU, we provide deep and insightful analyses. Ultimately, MMAU not only sheds light on the capabilities and limitations of LLM agents but also enhances the interpretability of their performance. Datasets and evaluation scripts of MMAU are released at https://github.com/apple/axlearn/tree/main/docs/research/mmau.

MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains

TL;DR

MMAU presents a holistic offline benchmark to evaluate LLM agents across domains and core capabilities, using prompts spanning subjects and tasks across a static dataset evaluated on representative models. It decomposes evaluation into Understanding, Reasoning, Planning, Problem-solving, and Self-correction, offering domain-centric and capability-centric analyses to illuminate strengths and gaps beyond traditional task-completion benchmarks. The benchmark integrates five data sources and tasks, including Tool-use, DAG QA, DS/ML coding, Contest-level coding, and Mathematics, enabling granular diagnostics of agent behavior. By providing dataset access and evaluation scripts, MMAU aims to improve interpretability, reliability, and comparability of LLM agent performance while complementing interactive benchmarks for a more complete assessment of agent capabilities.

Abstract

Recent advances in large language models (LLMs) have increased the demand for comprehensive benchmarks to evaluate their capabilities as human-like agents. Existing benchmarks, while useful, often focus on specific application scenarios, emphasizing task completion but failing to dissect the underlying skills that drive these outcomes. This lack of granularity makes it difficult to deeply discern where failures stem from. Additionally, setting up these environments requires considerable effort, and issues of unreliability and reproducibility sometimes arise, especially in interactive tasks. To address these limitations, we introduce the Massive Multitask Agent Understanding (MMAU) benchmark, featuring comprehensive offline tasks that eliminate the need for complex environment setups. It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics, and covers five essential capabilities: Understanding, Reasoning, Planning, Problem-solving, and Self-correction. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents. By testing 18 representative models on MMAU, we provide deep and insightful analyses. Ultimately, MMAU not only sheds light on the capabilities and limitations of LLM agents but also enhances the interpretability of their performance. Datasets and evaluation scripts of MMAU are released at https://github.com/apple/axlearn/tree/main/docs/research/mmau.
Paper Structure (28 sections, 15 figures, 10 tables)

This paper contains 28 sections, 15 figures, 10 tables.

Figures (15)

  • Figure 1: Evaluation results across different models on MMAU. For clarity, this figure includes only a selection of representative models. The domain-centric, capability-centric, and overall evaluation results are aggregated from all 20 tasks in MMAU. For detailed per-task evaluations, please refer to Appendix \ref{['sec:appendix_eval_results']}.
  • Figure 2: Overview of MMAU. MMAU is designed to provide both capability-centric evaluation (top) and domain-centric evaluation (bottom). It includes over 3K distinct prompts spanning 64 subjects and 5 domains. To evaluate the fundamental capabilities of LLM agents in a disentangled manner, we carefully designed 20 tasks aimed at decomposing these capabilities and assessing performance. Note: For clear visualization, the data examples and prompts here are simplified to illustrate an intuitive example. For the exact data examples and prompts, please refer to the Appendix \ref{['sec:appendix_data_examples']}\ref{['sec:appendix_task_prompts']}.
  • Figure 3: Different error types on a math problem.
  • Figure 4: Construction of planner-shift task and solver-shift task.
  • Figure 5: A Multi-turn coding and QA example for Data science and Machine learning.
  • ...and 10 more figures