Table of Contents
Fetching ...

Play to Generalize: Learning to Reason Through Game Play

Yunfei Xie, Yinsong Ma, Shiyi Lan, Alan Yuille, Junfei Xiao, Chen Wei

TL;DR

This work introduces Visual Game Learning (ViGaL), a post-training paradigm that finetunes a 7B multimodal LLM via reinforcement learning on simple visual arcade games to elicit transferable reasoning. Despite using no in-domain math data during RL, ViGaL shows strong zero-shot generalization to multimodal math and cross-domain benchmarks, often surpassing specialist models trained on target tasks, while preserving broad visual capabilities. Ablations reveal that game design, reward structure, and multimodal inputs jointly shape the downstream benefits, and combining multiple games yields additive gains across math subfields. The results point to a scalable strategy for unlocking generalizable reasoning in multimodal models through surrogate tasks, offering practical implications for data efficiency and model robustness.

Abstract

Developing reasoning capabilities in multimodal large language models (MLLMs) remains challenging. Motivated by literature suggesting that gameplay promotes transferable reasoning skills, we propose a novel post-training method, Visual Game Learning (ViGaL), where MLLMs develop generalizable reasoning skills through playing arcade-like games. Specifically, we show that training a 7B-parameter MLLM via reinforcement learning (RL) on simple games like Snake significantly enhances the downstream performance on multimodal math benchmarks like MathVista, on multi-discipline questions like MMMU and on 3D spatial reasoning benchmarks like VSI-Bench, without seeing any worked solutions, equations, or diagrams during RL. Remarkably, our model outperforms specialist models post-trained on benchmark-oriented multimodal reasoning data, while preserving the model's performance on general visual benchmarks, a challenge where specialist models often fall short. Our findings suggest that multimodal reasoning can emerge from gameplay, pointing to a promising strategy of designing surrogate tasks for RL post-training.

Play to Generalize: Learning to Reason Through Game Play

TL;DR

This work introduces Visual Game Learning (ViGaL), a post-training paradigm that finetunes a 7B multimodal LLM via reinforcement learning on simple visual arcade games to elicit transferable reasoning. Despite using no in-domain math data during RL, ViGaL shows strong zero-shot generalization to multimodal math and cross-domain benchmarks, often surpassing specialist models trained on target tasks, while preserving broad visual capabilities. Ablations reveal that game design, reward structure, and multimodal inputs jointly shape the downstream benefits, and combining multiple games yields additive gains across math subfields. The results point to a scalable strategy for unlocking generalizable reasoning in multimodal models through surrogate tasks, offering practical implications for data efficiency and model robustness.

Abstract

Developing reasoning capabilities in multimodal large language models (MLLMs) remains challenging. Motivated by literature suggesting that gameplay promotes transferable reasoning skills, we propose a novel post-training method, Visual Game Learning (ViGaL), where MLLMs develop generalizable reasoning skills through playing arcade-like games. Specifically, we show that training a 7B-parameter MLLM via reinforcement learning (RL) on simple games like Snake significantly enhances the downstream performance on multimodal math benchmarks like MathVista, on multi-discipline questions like MMMU and on 3D spatial reasoning benchmarks like VSI-Bench, without seeing any worked solutions, equations, or diagrams during RL. Remarkably, our model outperforms specialist models post-trained on benchmark-oriented multimodal reasoning data, while preserving the model's performance on general visual benchmarks, a challenge where specialist models often fall short. Our findings suggest that multimodal reasoning can emerge from gameplay, pointing to a promising strategy of designing surrogate tasks for RL post-training.

Paper Structure

This paper contains 44 sections, 2 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Overview of ViGaL.Left: We propose a novel post-training method where MLLMs are finetuned via RL to play arcade-style games such as Snake snake_bench_2025. We demonstrate that gameplay post-training enables MLLMs to achieve out-of-domain generalization, enhancing their performance on downstream multimodal reasoning tasks requiring math, spatial and multi-discipline reasoning, without using math or multi-displine data during RL. Right: Our ViGaL (RL on game) achieves higher average accuracy increase than MM-Eureka meng2025mm (RL on math) across three multimodal math benchmarks. This is notable because MM-Eureka trains on large-scale, curated math datasets, while ViGaL only uses game data. Details are in Tab \ref{['tab:math_generalization']}.
  • Figure 2: Post-training MLLMs to reason through RL with games. We propose post-training MLLMs via RL by playing visual games. We demonstrate this with two games: the classic arcade game Snake snake_bench_2025, and Rotation, a self-designed task to investigate spatial reasoning. In each game, the model receives multimodal inputs and follows reasoning instructions, e.g., path planning in Snake, angle estimation in Rotation. It reflects to choose an action, outputs its chain-of-thoughts and decision, e.g., best/worst move or predicted angle, and receives a reward. Through gameplay, the model obtains reasoning abilities that transfer to downstream multimodal reasoning tasks such as math and multi-discipline question answering.
  • Figure 3: Per-category gains on MathVerse are not uniform. The eight math categories follow MathVerse zhang2024mathversedoesmultimodalllm. (a) Snake yields the largest gains on Coordinates and Expressions, consistent with its 2D grid structure. (b) Rotation boosts Angle and Length questions but reduces Expression accuracy, suggesting its training primarily incentivizes orientation recognition.
  • Figure 4: Reasoning trace of different games and math questions. Top: Algebraic functions and coordinate-level interpretations that emerge from playing the Snake game help solving Expression questions. Bottom: Spatial reasoning skills incentivized by playing the Rotation game appear when solving Angle-related problems.
  • Figure 5: Goal and example response from model of Atari games used for evaluation. We implement 7 kinds of Atari games from Atari-GPT waytowich2024atari.
  • ...and 4 more figures