Table of Contents
Fetching ...

Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning

Jingqi Tong, Jixin Tang, Hangcheng Li, Yurong Mou, Ming Zhang, Jun Zhao, Yanbo Wen, Fan Song, Jiahao Zhan, Yuyang Lu, Chaoran Tao, Zhiyuan Guo, Jizhou Yu, Tianhao Cheng, Zhiheng Xi, Changhao Jiang, Zhangyue Yin, Yining Zheng, Weifeng Ge, Guanhua Chen, Tao Gui, Xipeng Qiu, Qi Zhang, Xuanjing Huang

TL;DR

This paper introduces Game-RL, a framework that uses synthesizable, verifiable game data to improve Vision-Language Models' general reasoning through reinforcement learning. A novel Code2Logic pipeline converts game code into a large, multimodal reasoning dataset called GameQA, which spans 30 games, 158 tasks, and about 140K questions. RL trained solely on GameQA with GRPO yields improvements across seven vision-language benchmarks and demonstrates noteworthy out-of-domain generalization, suggesting video game environments are a practical resource for broad reasoning capability. The work also reports on data quality, scaling and diversity effects, and confirms the remaining gap between current models and human performance on the GameQA benchmark.

Abstract

Vision-language reinforcement learning (RL) has primarily focused on narrow domains (e.g. geometry or chart reasoning). This leaves broader training scenarios and resources underexplored, limiting the exploration and learning of Vision Language Models (VLMs) through RL. We find video games inherently provide rich visual elements and mechanics that are easy to verify. To fully use the multimodal and verifiable reward in video games, we propose Game-RL, constructing diverse game tasks for RL training to boost VLMs general reasoning ability. To obtain training data, we propose Code2Logic, a novel approach that adapts game code to synthesize game reasoning task data, thus obtaining the GameQA dataset of 30 games and 158 tasks with controllable difficulty gradation. Unexpectedly, RL training solely on GameQA enables multiple VLMs to achieve performance improvements across 7 diverse vision-language benchmarks, demonstrating the value of Game-RL for enhancing VLMs' general reasoning. Furthermore, this suggests that video games may serve as valuable scenarios and resources to boost general reasoning abilities. Our code, dataset and models are available at the GitHub repository.

Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning

TL;DR

This paper introduces Game-RL, a framework that uses synthesizable, verifiable game data to improve Vision-Language Models' general reasoning through reinforcement learning. A novel Code2Logic pipeline converts game code into a large, multimodal reasoning dataset called GameQA, which spans 30 games, 158 tasks, and about 140K questions. RL trained solely on GameQA with GRPO yields improvements across seven vision-language benchmarks and demonstrates noteworthy out-of-domain generalization, suggesting video game environments are a practical resource for broad reasoning capability. The work also reports on data quality, scaling and diversity effects, and confirms the remaining gap between current models and human performance on the GameQA benchmark.

Abstract

Vision-language reinforcement learning (RL) has primarily focused on narrow domains (e.g. geometry or chart reasoning). This leaves broader training scenarios and resources underexplored, limiting the exploration and learning of Vision Language Models (VLMs) through RL. We find video games inherently provide rich visual elements and mechanics that are easy to verify. To fully use the multimodal and verifiable reward in video games, we propose Game-RL, constructing diverse game tasks for RL training to boost VLMs general reasoning ability. To obtain training data, we propose Code2Logic, a novel approach that adapts game code to synthesize game reasoning task data, thus obtaining the GameQA dataset of 30 games and 158 tasks with controllable difficulty gradation. Unexpectedly, RL training solely on GameQA enables multiple VLMs to achieve performance improvements across 7 diverse vision-language benchmarks, demonstrating the value of Game-RL for enhancing VLMs' general reasoning. Furthermore, this suggests that video games may serve as valuable scenarios and resources to boost general reasoning abilities. Our code, dataset and models are available at the GitHub repository.

Paper Structure

This paper contains 91 sections, 1 equation, 24 figures, 17 tables.

Figures (24)

  • Figure 1: Overview of Code2Logic approach. The process involves three main steps: (1) using LLMs to construct game code. (2) LLM-assisted design of the task templates including question and analysis templates based on the generated game code. Each task template condenses one type of reasoning pattern in the game. (3) Using LLMs to construct a data engine that directly reuses the core game code from the first step, including functions like move. (4) After these main steps, the data engine is executed to fill in the task templates developed in Step 2 and generate data samples, as illustrated in the Final Result Section.
  • Figure 2: Four game examples from GameQA: 3D Reconstruction, Tangram, Sudoku, and Sokoban, each representing distinct cognitive categories. Each game displays two VQA examples consisting of: (a) current game state visualization, (b) a targeted question, and (c) step-by-step reasoning with the answer. GameQA transforms complex game-playing tasks into this structured VQA format. See Appendix \ref{['app:games_example_samples']} for more VQA examples of some representative games.
  • Figure 3: Overview of the GameQA dataset. The 30 games in GameQA can be classified into four categories based on the core abilities required to solve game tasks. Appendix \ref{['app_sec:game_category']} provides definitions of the four game categories. Games chose as Out-of-Domain are not used for training; instead, they are used to test the generalization performance after the model has been trained on In-Domain games.
  • Figure 4: The scaling effect of training data quantity on general vision benchmarks for Qwen2.5-VL-7B-Instruct. The model was trained on a total of 20k samples (20 games) and evaluated every 1,000 samples. To clearly demonstrate the upward trend, the results are divided into three stages and presented using bin averaging (as described in Section \ref{['sec:data_scale_experiment']}).
  • Figure 5: The scaling effect of GameQA. As VLMs are trained on an increasing number of distinct games, their performance on general visual benchmarks improves. Game selection is shown in Table \ref{['tab:game_selection_scaling']}.
  • ...and 19 more figures