Table of Contents
Fetching ...

V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models

Xiangxi Zheng, Linjie Li, Zhengyuan Yang, Ping Yu, Alex Jinpeng Wang, Rui Yan, Yuan Yao, Lijuan Wang

TL;DR

This work tackles the inadequacy of static, text-friendly benchmarks for evaluating vision-centric reasoning in multimodal LLMs. It proposes V-MAGE, a game-based framework with five video games and 30+ levels, evaluated via a dynamic Elo ranking against human baselines in continuous-space visual environments. The approach reveals that, despite scaling, current MLLMs underperform humans in complex, interactive tasks due to perceptual and reasoning bottlenecks, including temporal tracking and anchoring biases. The findings offer concrete guidance for improving perception, temporal reasoning, and agentic strategies, and the authors provide open-source tooling to foster community progress.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in visual-text processing. However, existing static image-text benchmarks are insufficient for evaluating their dynamic perception and interactive reasoning abilities. We introduce Vision-centric Multiple Abilities Game Evaluation(V-MAGE), a novel game-based evaluation framework designed to systematically assess MLLMs' visual reasoning in interactive, continuous-space environments. V-MAGE features five distinct video games comprising over 30 carefully constructed evaluation scenarios. These scenarios are set in free-form, visually complex environments that require models to interpret dynamic game states and make decisions based solely on visual input, thereby closely reflecting the conditions encountered by human players. To ensure robust and interpretable comparisons across models, V-MAGE employs a dynamic Elo-based ranking system that accounts for varying difficulty levels and task diversity. Benchmarking state-of-the-art MLLMs against human baselines reveals that while leading models approach human-level performance in simple tasks, their performance drops significantly in complex scenarios requiring advanced reasoning and task orchestration. This persistent performance gap highlights fundamental limitations in current MLLMs' ability to perform real-time, vision-grounded interactions. Through extensive analyses, we demonstrate the utility of V-MAGE in uncovering these limitations and providing actionable insights for improving the visual and reasoning capabilities of MLLMs in dynamic, interactive settings. Code is publicly available at https://github.com/CSU-JPG/V-MAGE.

V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models

TL;DR

This work tackles the inadequacy of static, text-friendly benchmarks for evaluating vision-centric reasoning in multimodal LLMs. It proposes V-MAGE, a game-based framework with five video games and 30+ levels, evaluated via a dynamic Elo ranking against human baselines in continuous-space visual environments. The approach reveals that, despite scaling, current MLLMs underperform humans in complex, interactive tasks due to perceptual and reasoning bottlenecks, including temporal tracking and anchoring biases. The findings offer concrete guidance for improving perception, temporal reasoning, and agentic strategies, and the authors provide open-source tooling to foster community progress.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in visual-text processing. However, existing static image-text benchmarks are insufficient for evaluating their dynamic perception and interactive reasoning abilities. We introduce Vision-centric Multiple Abilities Game Evaluation(V-MAGE), a novel game-based evaluation framework designed to systematically assess MLLMs' visual reasoning in interactive, continuous-space environments. V-MAGE features five distinct video games comprising over 30 carefully constructed evaluation scenarios. These scenarios are set in free-form, visually complex environments that require models to interpret dynamic game states and make decisions based solely on visual input, thereby closely reflecting the conditions encountered by human players. To ensure robust and interpretable comparisons across models, V-MAGE employs a dynamic Elo-based ranking system that accounts for varying difficulty levels and task diversity. Benchmarking state-of-the-art MLLMs against human baselines reveals that while leading models approach human-level performance in simple tasks, their performance drops significantly in complex scenarios requiring advanced reasoning and task orchestration. This persistent performance gap highlights fundamental limitations in current MLLMs' ability to perform real-time, vision-grounded interactions. Through extensive analyses, we demonstrate the utility of V-MAGE in uncovering these limitations and providing actionable insights for improving the visual and reasoning capabilities of MLLMs in dynamic, interactive settings. Code is publicly available at https://github.com/CSU-JPG/V-MAGE.

Paper Structure

This paper contains 39 sections, 7 equations, 29 figures, 31 tables.

Figures (29)

  • Figure 1: The overview of the V-MAGE benchmark, designed to evaluate vision-centric capabilities and higher-level reasoning of MLLMs across 5 free-form games with 30+ levels. V-MAGE assesses critical abilities in visual reasoning, providing a comprehensive evaluation of model performance in complex, dynamic environments.
  • Figure 2: V-MAGE games and evaluation pipeline. V-MAGE employs five distinct games, each with several levels, to facilitate a decomposed evaluation of model performance. These games include FlappyBird, Race, SuperMario, Pong and TempestRun. During the evaluation process, the Agent module receives visual game state information directly from the Game module, primarily in the form of screenshots. The Agent module then structures these screenshots, combined with prompts containing the game rules, into the appropriate input format for MLLMs. Subsequently, the model's output is processed by the Agent module to generate executable actions, which are then transmitted back to the Game module to update the environment state.
  • Figure 3: Race level design. Six levels progressively increase in difficulty while sharing the core objective: navigating a car to a trophy. Detailed Race level configurations are provided in Appendix Table\ref{['tab:race level configs']}.
  • Figure 4: The MLLM trails humans by a large margin in all six games. The levels with an asterisk (*) represent 'no history'. Detailed performance metrics for each model across individual game levels are provided in Appendix \ref{['detailed statistics']} (Tables \ref{['tab:race_performance']}-\ref{['tab:tempestrun_performance']}).
  • Figure 5: Capability maps of the underlying visual capabilities of each model.
  • ...and 24 more figures