Table of Contents
Fetching ...

MageBench: Bridging Large Multimodal Models to Agents

Miaosen Zhang, Qi Dai, Yifan Yang, Jianmin Bao, Dongdong Chen, Kai Qiu, Chong Luo, Xin Geng, Baining Guo

TL;DR

MageBench targets the gap in evaluating large multimodal models as agents by introducing a Vision-in-the-Chain paradigm across three lightweight environments (WebUI, Sokoban, Football). It operationalizes ViC with two agent baselines (Global and Online) and a novel AES metric for WebUI, revealing that current LMMs struggle with continuous visual feedback, long-context reasoning, and planning compared to humans. The results highlight specific deficiencies in interleaved image-text processing, visual imagination, and high-level planning, providing concrete directions for model and prompt design to improve embodied, multimodal reasoning. The work offers a scalable, reproducible benchmark with open data/code to drive research toward truly autonomous multimodal agents and demonstrates potential generalization to robotics and structured visual generation tasks.

Abstract

LMMs have shown impressive visual understanding capabilities, with the potential to be applied in agents, which demand strong reasoning and planning abilities. Nevertheless, existing benchmarks mostly assess their reasoning abilities in language part, where the chain-of-thought is entirely composed of text.We consider the scenario where visual signals are continuously updated and required along the decision making process. Such vision-in-the-chain reasoning paradigm is more aligned with the needs of multimodal agents, while being rarely evaluated. In this paper, we introduce MageBench, a reasoning capability oriented multimodal agent benchmark that, while having light-weight environments, poses significant reasoning challenges and holds substantial practical value. This benchmark currently includes three types of environments: WebUI, Sokoban, and Football, comprising a total of 483 different scenarios. It thoroughly validates the agent's knowledge and engineering capabilities, visual intelligence, and interaction skills. The results show that only a few product-level models are better than random acting, and all of them are far inferior to human-level. More specifically, we found current models severely lack the ability to modify their planning based on visual feedback, as well as visual imagination, interleaved image-text long context handling, and other abilities. We hope that our work will provide optimization directions for LMM from the perspective of being an agent. We release our code and data at https://github.com/microsoft/MageBench.

MageBench: Bridging Large Multimodal Models to Agents

TL;DR

MageBench targets the gap in evaluating large multimodal models as agents by introducing a Vision-in-the-Chain paradigm across three lightweight environments (WebUI, Sokoban, Football). It operationalizes ViC with two agent baselines (Global and Online) and a novel AES metric for WebUI, revealing that current LMMs struggle with continuous visual feedback, long-context reasoning, and planning compared to humans. The results highlight specific deficiencies in interleaved image-text processing, visual imagination, and high-level planning, providing concrete directions for model and prompt design to improve embodied, multimodal reasoning. The work offers a scalable, reproducible benchmark with open data/code to drive research toward truly autonomous multimodal agents and demonstrates potential generalization to robotics and structured visual generation tasks.

Abstract

LMMs have shown impressive visual understanding capabilities, with the potential to be applied in agents, which demand strong reasoning and planning abilities. Nevertheless, existing benchmarks mostly assess their reasoning abilities in language part, where the chain-of-thought is entirely composed of text.We consider the scenario where visual signals are continuously updated and required along the decision making process. Such vision-in-the-chain reasoning paradigm is more aligned with the needs of multimodal agents, while being rarely evaluated. In this paper, we introduce MageBench, a reasoning capability oriented multimodal agent benchmark that, while having light-weight environments, poses significant reasoning challenges and holds substantial practical value. This benchmark currently includes three types of environments: WebUI, Sokoban, and Football, comprising a total of 483 different scenarios. It thoroughly validates the agent's knowledge and engineering capabilities, visual intelligence, and interaction skills. The results show that only a few product-level models are better than random acting, and all of them are far inferior to human-level. More specifically, we found current models severely lack the ability to modify their planning based on visual feedback, as well as visual imagination, interleaved image-text long context handling, and other abilities. We hope that our work will provide optimization directions for LMM from the perspective of being an agent. We release our code and data at https://github.com/microsoft/MageBench.

Paper Structure

This paper contains 49 sections, 21 equations, 32 figures, 5 tables, 1 algorithm.

Figures (32)

  • Figure 1: Overview of the MageBench. MageBench is a multi-modality agent benchmark as well as a lightweight and fast platform for reasoning oriented agent researches. It currently contains three enviroments: WebUI, Sokoban and Football. The results indicate that the existing models are still far from reaching human-level performance as an agent. Only a few models outperform the results of random actions, represented by the black dashed line in the bar chart.
  • Figure 2: The difference between vision-in-the-chain reasoning and existing reasoning paradigm. Example images are adapted from imageofthoughtsokobanvila.
  • Figure 3: An overview of WebUI and its evaluation. LMM Agents are required to re-generate the webpage according to the description. We match the generated elements with the atomic elements in the ground truth. Then we compare the CSS attributes to obtain a similarity score. A specific example of task description can be found in Appendix. \ref{['sec:webtaskdes']}. Technique details of evaluation can be found in Appendix. \ref{['app:webuieval']}.
  • Figure 4: The segment from the Germany vs. Mexico match in the 2018 FIFA World Cup (right), and the initial game scene inspired by it (left). Model analyzes and generates one of the actions (bottom), similar process for the Sokoban-Online.
  • Figure 5: WebUI-Online results. Different line styles represent different prompt types, while different colors denote different models. The horizontal axis shows the number of iterations the model takes to modify the webpage code based on feedback. The gray-shaded areas indicate regions of variance.
  • ...and 27 more figures