Table of Contents
Fetching ...

Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play

Qinsi Wang, Bo Liu, Tianyi Zhou, Jing Shi, Yueqian Lin, Yiran Chen, Hai Helen Li, Kun Wan, Wentian Zhao

TL;DR

Vision-Zero presents a zero-human-in-the-loop framework for scalable VLM self-improvement via strategic visual self-play framed as a two-stage Who Is the Spy game. By using label-free, domain-agnostic image pairs across CLEVR, charts, and real-world edits and an Iterative-SPO training loop that alternates self-play with RL-based supervision, it achieves sustained performance gains and broader generalization than annotation-heavy baselines. The method mitigates stagnation and negative transfer while reducing data costs, enabling efficient, generalizable reasoning, chart/OCR, and vision-centric understanding. The results suggest a practical, cost-effective path toward scalable VLM development with broad applicability.

Abstract

Although reinforcement learning (RL) can effectively enhance the reasoning capabilities of vision-language models (VLMs), current methods remain heavily dependent on labor-intensive datasets that require extensive manual construction and verification, leading to extremely high training costs and consequently constraining the practical deployment of VLMs. To address this challenge, we propose Vision-Zero, a domain-agnostic framework enabling VLM self-improvement through competitive visual games generated from arbitrary image pairs. Specifically, Vision-Zero encompasses three main attributes: (1) Strategic Self-Play Framework: Vision-Zero trains VLMs in "Who Is the Spy"-style games, where the models engage in strategic reasoning and actions across multiple roles. Through interactive gameplay, models autonomously generate their training data without human annotation. (2) Gameplay from Arbitrary Images: Unlike existing gamified frameworks, Vision-Zero can generate games from arbitrary images, thereby enhancing the model's reasoning ability across diverse domains and showing strong generalization to different tasks. We demonstrate this versatility using three distinct types of image datasets: CLEVR-based synthetic scenes, charts, and real-world images. (3) Sustainable Performance Gain: We introduce Iterative Self-Play Policy Optimization (Iterative-SPO), a novel training algorithm that alternates between Self-Play and reinforcement learning with verifiable rewards (RLVR), mitigating the performance plateau often seen in self-play-only training and achieving sustained long-term improvements. Despite using label-free data, Vision-Zero achieves state-of-the-art performance on reasoning, chart question answering, and vision-centric understanding tasks, surpassing other annotation-based methods. Models and code has been released at https://github.com/wangqinsi1/Vision-Zero.

Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play

TL;DR

Vision-Zero presents a zero-human-in-the-loop framework for scalable VLM self-improvement via strategic visual self-play framed as a two-stage Who Is the Spy game. By using label-free, domain-agnostic image pairs across CLEVR, charts, and real-world edits and an Iterative-SPO training loop that alternates self-play with RL-based supervision, it achieves sustained performance gains and broader generalization than annotation-heavy baselines. The method mitigates stagnation and negative transfer while reducing data costs, enabling efficient, generalizable reasoning, chart/OCR, and vision-centric understanding. The results suggest a practical, cost-effective path toward scalable VLM development with broad applicability.

Abstract

Although reinforcement learning (RL) can effectively enhance the reasoning capabilities of vision-language models (VLMs), current methods remain heavily dependent on labor-intensive datasets that require extensive manual construction and verification, leading to extremely high training costs and consequently constraining the practical deployment of VLMs. To address this challenge, we propose Vision-Zero, a domain-agnostic framework enabling VLM self-improvement through competitive visual games generated from arbitrary image pairs. Specifically, Vision-Zero encompasses three main attributes: (1) Strategic Self-Play Framework: Vision-Zero trains VLMs in "Who Is the Spy"-style games, where the models engage in strategic reasoning and actions across multiple roles. Through interactive gameplay, models autonomously generate their training data without human annotation. (2) Gameplay from Arbitrary Images: Unlike existing gamified frameworks, Vision-Zero can generate games from arbitrary images, thereby enhancing the model's reasoning ability across diverse domains and showing strong generalization to different tasks. We demonstrate this versatility using three distinct types of image datasets: CLEVR-based synthetic scenes, charts, and real-world images. (3) Sustainable Performance Gain: We introduce Iterative Self-Play Policy Optimization (Iterative-SPO), a novel training algorithm that alternates between Self-Play and reinforcement learning with verifiable rewards (RLVR), mitigating the performance plateau often seen in self-play-only training and achieving sustained long-term improvements. Despite using label-free data, Vision-Zero achieves state-of-the-art performance on reasoning, chart question answering, and vision-centric understanding tasks, surpassing other annotation-based methods. Models and code has been released at https://github.com/wangqinsi1/Vision-Zero.

Paper Structure

This paper contains 22 sections, 9 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: Vision-Zero Paradigm. (a) Supervised learning depends on human-curated reasoning trajectories; (b) Reinforcement Learning, although enabling models to autonomously learn reasoning processes via validated rewards, still relies heavily on expert-designed question-answer pairs. (c) In contrast, Vision-Zero is a novel self-improvement paradigm entirely independent of human experience. It constructs self-play games by leveraging image pairs that exhibit visual differences. Through the interactive and strategic game, Vision-Zero continuously generates training data for VLMs, enabling the model to achieve scalable self-improvement.
  • Figure 2: Performance Comparison of Vision-Zero with SOTA post-training methods. All models were post-trained on Qwen2.5-VL-7B. The numbers on the horizontal axis represent the accuracy of Qwen2.5-VL-7B on different tasks, while the vertical axis represents the change in accuracy of the trained model. Vision-Zero outperforms baselines trained on expensive human-labeled datasets.
  • Figure 3: Overall Framework of Vision-Zero. Vision-Zero comprises three core components. Strategic Game Environment: Each role is required to exhibit strategic behavior tailored to diverse scenarios, thereby simultaneously necessitating multiple capabilities. Label-free and Domain-agnostic Data Input: Vision-Zero accepts arbitrary inputs to promote diversity and generalization. To verify this, we train Qwen2.5-VL-7B for 100 iterations on Gobang and our environment and evaluate on MathVision; results show that Vision-Zero effective generalization. Iterative-SPO: We introduce a novel two-stage training algorithm. In the clue stage, models are trained via Self-Play using a zero-sum reward inversely proportional to votes received. In the decision stage, models undergo RLVR training with group normalization, using rewards based on vote correctness.
  • Figure 4: Visualization of the datasets used in Vision-Zero. We employ three representative data in our experiments: (left) CLEVR-based data, (middle) Chart-based data, and (right) Real-world data. For visualization, difference parts are circled, which are not present in the SPY images in game.
  • Figure 5: Visualization of spy reasoning in Vision-Zero. A comparison of model responses to identical scenarios before and after training, as evaluated by GPT-based scoring, reveals substantial improvements in planning, retrieval, decomposition, strategy formulation, and logical reasoning.
  • ...and 2 more figures