Table of Contents
Fetching ...

LM Fight Arena: Benchmarking Large Multimodal Models via Game Competition

Yushuo Zheng, Zicheng Zhang, Xiongkuo Min, Huiyu Duan, Guangtao Zhai

TL;DR

LM Fight Arena introduces a fully automated benchmark for evaluating large multimodal models in real-time adversarial settings by pitting six models in Mortal Kombat II where all agents control the same character. The framework combines frame-level visual data with structured game-state features and uses a language-driven controller to convert textual actions into button presses, enabling reproducible, zero-shot evaluation of sequential decision-making. Results reveal a pronounced performance hierarchy, with Claude 3.5 Sonnet achieving $100%$ wins, followed by Gemini 2.5 Pro and Qwen variants, while GPT-4o fails to win, highlighting gaps between static multimodal understanding and dynamic, action-oriented reasoning. The work argues that fighting-game environments offer a valuable, interpretable, and practical direction for evaluating real-time perception-action coupling in LMMs and outlines future directions for broader game genres, longer match series, and human calibration to ensure robust, safe deployment in interactive AI systems.

Abstract

Existing benchmarks for large multimodal models (LMMs) often fail to capture their performance in real-time, adversarial environments. We introduce LM Fight Arena (Large Model Fight Arena), a novel framework that evaluates LMMs by pitting them against each other in the classic fighting game Mortal Kombat II, a task requiring rapid visual understanding and tactical, sequential decision-making. In a controlled tournament, we test six leading open- and closed-source models, where each agent operates controlling the same character to ensure a fair comparison. The models are prompted to interpret game frames and state data to select their next actions. Unlike static evaluations, LM Fight Arena provides a fully automated, reproducible, and objective assessment of an LMM's strategic reasoning capabilities in a dynamic setting. This work introduces a challenging and engaging benchmark that bridges the gap between AI evaluation and interactive entertainment.

LM Fight Arena: Benchmarking Large Multimodal Models via Game Competition

TL;DR

LM Fight Arena introduces a fully automated benchmark for evaluating large multimodal models in real-time adversarial settings by pitting six models in Mortal Kombat II where all agents control the same character. The framework combines frame-level visual data with structured game-state features and uses a language-driven controller to convert textual actions into button presses, enabling reproducible, zero-shot evaluation of sequential decision-making. Results reveal a pronounced performance hierarchy, with Claude 3.5 Sonnet achieving wins, followed by Gemini 2.5 Pro and Qwen variants, while GPT-4o fails to win, highlighting gaps between static multimodal understanding and dynamic, action-oriented reasoning. The work argues that fighting-game environments offer a valuable, interpretable, and practical direction for evaluating real-time perception-action coupling in LMMs and outlines future directions for broader game genres, longer match series, and human calibration to ensure robust, safe deployment in interactive AI systems.

Abstract

Existing benchmarks for large multimodal models (LMMs) often fail to capture their performance in real-time, adversarial environments. We introduce LM Fight Arena (Large Model Fight Arena), a novel framework that evaluates LMMs by pitting them against each other in the classic fighting game Mortal Kombat II, a task requiring rapid visual understanding and tactical, sequential decision-making. In a controlled tournament, we test six leading open- and closed-source models, where each agent operates controlling the same character to ensure a fair comparison. The models are prompted to interpret game frames and state data to select their next actions. Unlike static evaluations, LM Fight Arena provides a fully automated, reproducible, and objective assessment of an LMM's strategic reasoning capabilities in a dynamic setting. This work introduces a challenging and engaging benchmark that bridges the gap between AI evaluation and interactive entertainment.

Paper Structure

This paper contains 14 sections, 4 figures.

Figures (4)

  • Figure 1: The LM Fight Arena. Six state-of-the-art large multimodal models (LMMs) compete in a round-robin tournament in the classic fighting game Mortal Kombat II. Each model controls the same character, Liu Kang, to ensure a fair comparison. The models receive real-time visual frames and structured game state information, then output their next actions as natural language commands.
  • Figure 2: Overview of the LM Fight Arena control loop. The left column illustrates visual processing: the emulator streams raw frames, we subsample every fourth frame, and we annotate player positions before packaging the sequence. The central block shows the real-time interaction between the emulator and the language-driven controller that translates text actions into Sega Genesis button presses. The right column depicts the structured state features---health bars, absolute coordinates, facing direction, and the trailing action history---that accompany the visual stack.
  • Figure 3: (a) Tournament matchup matrix showing the winner's remaining health percentage for each model pair. Green cells indicate a win for the row model, red cells indicate a loss. (b) Bar chart summarizing overall win rates for each model.
  • Figure 4: Heatmap of button press frequencies across all models and matches. Brighter colors indicate more frequent usage of that button.