Table of Contents
Fetching ...

Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?

Qianshan Wei, Yishan Yang, Siyi Wang, Jinglin Chen, Binyu Wang, Jiaming Wang, Shuang Chen, Zechen Li, Yang Shi, Yuqi Tang, Weining Wang, Yi Yu, Chaoyou Fu, Qi Li, Yi-Fan Zhang

Abstract

Multimodal Large Language Models (MLLMs) are evolving from passive observers into active agents, solving problems through Visual Expansion (invoking visual tools) and Knowledge Expansion (open-web search). However, existing evaluations fall short: they lack flexible tool integration, test visual and search tools separately, and evaluate primarily by final answers. Consequently, they cannot verify if tools were actually invoked, applied correctly, or used efficiently. To address this, we introduce Agentic-MME, a process-verified benchmark for Multimodal Agentic Capabilities. It contains 418 real-world tasks across 6 domains and 3 difficulty levels to evaluate capability synergy, featuring over 2,000 stepwise checkpoints that average 10+ person-hours of manual annotation per task. Each task includes a unified evaluation framework supporting sandboxed code and APIs, alongside a human reference trajectory annotated with stepwise checkpoints along dual-axis: S-axis and V-axis. To enable true process-level verification, we audit fine-grained intermediate states rather than just final answers, and quantify efficiency via an overthinking metric relative to human trajectories. Experimental results show the best model, Gemini3-pro, achieves 56.3% overall accuracy, which falls significantly to 23.0% on Level-3 tasks, underscoring the difficulty of real-world multimodal agentic problem solving.

Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?

Abstract

Multimodal Large Language Models (MLLMs) are evolving from passive observers into active agents, solving problems through Visual Expansion (invoking visual tools) and Knowledge Expansion (open-web search). However, existing evaluations fall short: they lack flexible tool integration, test visual and search tools separately, and evaluate primarily by final answers. Consequently, they cannot verify if tools were actually invoked, applied correctly, or used efficiently. To address this, we introduce Agentic-MME, a process-verified benchmark for Multimodal Agentic Capabilities. It contains 418 real-world tasks across 6 domains and 3 difficulty levels to evaluate capability synergy, featuring over 2,000 stepwise checkpoints that average 10+ person-hours of manual annotation per task. Each task includes a unified evaluation framework supporting sandboxed code and APIs, alongside a human reference trajectory annotated with stepwise checkpoints along dual-axis: S-axis and V-axis. To enable true process-level verification, we audit fine-grained intermediate states rather than just final answers, and quantify efficiency via an overthinking metric relative to human trajectories. Experimental results show the best model, Gemini3-pro, achieves 56.3% overall accuracy, which falls significantly to 23.0% on Level-3 tasks, underscoring the difficulty of real-world multimodal agentic problem solving.

Paper Structure

This paper contains 45 sections, 2 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Case Studies in Agentic-MME across three difficulty levels. The examples highlight the benchmark's escalating complexity, evolving from isolated visual operations in Level 1 to deeply synergistic, multi-round visual and knowledge workflows in Level 3.
  • Figure 2: Data Collection and Annotation pipeline, including image sourcing, backward drafting, granular step-wise annotation, and quality assurance.
  • Figure 3: Overview of Agentic-MME Dataset Statistics. The benchmark exhibits broad domain and semantic diversity, with increasing tool calls and checkpoints reflecting the escalating demand for long-horizon reasoning across difficulty levels.
  • Figure 4: Fine-Grained Error Analysis. The heatmap illustrates the frequency of seven failure modes across different difficulty levels, averaged over both Code (Gen) and Atomic (Atm) execution modes. Darker colors denote higher frequencies.
  • Figure 5: more fine-grained domain distribution
  • ...and 8 more figures