LM Fight Arena: Benchmarking Large Multimodal Models via Game Competition
Yushuo Zheng, Zicheng Zhang, Xiongkuo Min, Huiyu Duan, Guangtao Zhai
TL;DR
LM Fight Arena introduces a fully automated benchmark for evaluating large multimodal models in real-time adversarial settings by pitting six models in Mortal Kombat II where all agents control the same character. The framework combines frame-level visual data with structured game-state features and uses a language-driven controller to convert textual actions into button presses, enabling reproducible, zero-shot evaluation of sequential decision-making. Results reveal a pronounced performance hierarchy, with Claude 3.5 Sonnet achieving $100%$ wins, followed by Gemini 2.5 Pro and Qwen variants, while GPT-4o fails to win, highlighting gaps between static multimodal understanding and dynamic, action-oriented reasoning. The work argues that fighting-game environments offer a valuable, interpretable, and practical direction for evaluating real-time perception-action coupling in LMMs and outlines future directions for broader game genres, longer match series, and human calibration to ensure robust, safe deployment in interactive AI systems.
Abstract
Existing benchmarks for large multimodal models (LMMs) often fail to capture their performance in real-time, adversarial environments. We introduce LM Fight Arena (Large Model Fight Arena), a novel framework that evaluates LMMs by pitting them against each other in the classic fighting game Mortal Kombat II, a task requiring rapid visual understanding and tactical, sequential decision-making. In a controlled tournament, we test six leading open- and closed-source models, where each agent operates controlling the same character to ensure a fair comparison. The models are prompted to interpret game frames and state data to select their next actions. Unlike static evaluations, LM Fight Arena provides a fully automated, reproducible, and objective assessment of an LMM's strategic reasoning capabilities in a dynamic setting. This work introduces a challenging and engaging benchmark that bridges the gap between AI evaluation and interactive entertainment.
