Table of Contents
Fetching ...

Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning

Tingyu Li, Zheng Sun, Jingxuan Wei, Siyuan Li, Conghui He, Lijun Wu, Cheng Tan

TL;DR

This work tackles the data scarcity bottleneck in reinforcement-learning–driven vision-language reasoning, especially in specialized domains. It introduces DoGe, a dual-decoupling framework that separates context-focused thinking (Thinker) from task-solving (Solver) and trains them in a two-stage RL loop, guided by GRPO. An iterative curriculum data-synthesis pipeline (Multimodal Knowledge Pool and Seed Problem Pool) expands training diversity and supports self-bootstrap self-evolution across domain-specific benchmarks. Across seven diverse tests, DoGe improves performance, enhances exploration, and stabilizes training, offering a scalable path toward self-evolving LVLMs with reduced reliance on high-quality labeled data.

Abstract

Recent vision-language models (VLMs) achieve remarkable reasoning through reinforcement learning (RL), which provides a feasible solution for realizing continuous self-evolving large vision-language models (LVLMs) in the era of experience. However, RL for VLMs requires abundant high-quality multimodal data, especially challenging in specialized domains like chemistry, earth sciences, and multimodal mathematics. Existing strategies such as synthetic data and self-rewarding mechanisms suffer from limited distributions and alignment difficulties, ultimately causing reward hacking: models exploit high-reward patterns, collapsing policy entropy and destabilizing training. We propose DoGe (Decouple to Generalize), a dual-decoupling framework that guides models to first learn from context rather than problem solving by refocusing on the problem context scenarios overlooked by synthetic data methods. By decoupling learning process into dual components (Thinker and Solver), we reasonably quantify the reward signals of this process and propose a two-stage RL post-training approach from freely exploring context to practically solving tasks. Second, to increase the diversity of training data, DoGe constructs an evolving curriculum learning pipeline: an expanded native domain knowledge corpus and an iteratively evolving seed problems pool. Experiments show that our method consistently outperforms the baseline across various benchmarks, providing a scalable pathway for realizing self-evolving LVLMs.

Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning

TL;DR

This work tackles the data scarcity bottleneck in reinforcement-learning–driven vision-language reasoning, especially in specialized domains. It introduces DoGe, a dual-decoupling framework that separates context-focused thinking (Thinker) from task-solving (Solver) and trains them in a two-stage RL loop, guided by GRPO. An iterative curriculum data-synthesis pipeline (Multimodal Knowledge Pool and Seed Problem Pool) expands training diversity and supports self-bootstrap self-evolution across domain-specific benchmarks. Across seven diverse tests, DoGe improves performance, enhances exploration, and stabilizes training, offering a scalable path toward self-evolving LVLMs with reduced reliance on high-quality labeled data.

Abstract

Recent vision-language models (VLMs) achieve remarkable reasoning through reinforcement learning (RL), which provides a feasible solution for realizing continuous self-evolving large vision-language models (LVLMs) in the era of experience. However, RL for VLMs requires abundant high-quality multimodal data, especially challenging in specialized domains like chemistry, earth sciences, and multimodal mathematics. Existing strategies such as synthetic data and self-rewarding mechanisms suffer from limited distributions and alignment difficulties, ultimately causing reward hacking: models exploit high-reward patterns, collapsing policy entropy and destabilizing training. We propose DoGe (Decouple to Generalize), a dual-decoupling framework that guides models to first learn from context rather than problem solving by refocusing on the problem context scenarios overlooked by synthetic data methods. By decoupling learning process into dual components (Thinker and Solver), we reasonably quantify the reward signals of this process and propose a two-stage RL post-training approach from freely exploring context to practically solving tasks. Second, to increase the diversity of training data, DoGe constructs an evolving curriculum learning pipeline: an expanded native domain knowledge corpus and an iteratively evolving seed problems pool. Experiments show that our method consistently outperforms the baseline across various benchmarks, providing a scalable pathway for realizing self-evolving LVLMs.

Paper Structure

This paper contains 27 sections, 36 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: DoGe decouples the self-evolving VLM cognitive process into a "learning-application" cycle, designing two components: a learnable Thinker and a frozen Solver. In the first stage, the pass rate of the Solver is used as the quantitative reward for the Analysis (the output of the Thinker). In the second stage, standard GRPO is implemented for annealing, forming a complete iterative closed loop.
  • Figure 2: DoGe masks the question part and retains only the context. As shown in the example, DoGe preserves only the molecular image input. The Thinker attempts to conduct in-depth thinking without specific question input, embeds its output into the Solver, and uses the pass rate of the Solver in solving the original question as the quantitative reward criterion for the Thinker.
  • Figure 3: DoGe's data synthesis framework is analogous to the learning process of humans—learning knowledge from the world and then applying it to solve problems. DoGe first collects a large amount of unlabeled data from the web and databases via tools. The data is aggregated into a Multimodal Knowledge Pool. The LVLM transforms it into learnable vision-question-answer pairs. The training data for DoGe consists of the those designed questions and variant problems synthesized from the iteratively updated Seed Problem Pool.
  • Figure 4: Average Entropy (DoGe vs. Baseline) during training. "Anneal" refers to DoGe's RL stage 2. Compared to the baseline, DoGe exhibits a higher initial policy entropy during training and consistently maintains greater exploration.
  • Figure 5: Distribution comparison on the training data in the mathematical field. We select training data's subset of Vision-R1 and ours method. The visualization results presented by applying text-embedding-3-large embedding to the text part of the problem and then performing t-SNE dimensionality reduction.