Table of Contents
Fetching ...

WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning

Jie Yang, Feipeng Ma, Zitian Wang, Dacheng Yin, Kang Rong, Fengyun Rao, Ruimao Zhang

TL;DR

This work tackles general-purpose visual-language reasoning by combining an automated, scalable QA synthesis pipeline with a large, diverse WeThink dataset containing over 120K multimodal QA pairs and explicit reasoning paths. It introduces a hybrid reinforcement learning framework (GRPO with a mix of rule-based and model-based rewards) to train open-source vision-language models across 14 benchmarks, demonstrating that increased data diversity and reasoning-centric prompts improve cross-domain performance. Key contributions include the WeThink data generation pipeline, the 120K QA-RL-ready dataset, and empirical evidence that both supervised CoT fine-tuning and RL can enhance general multimodal reasoning, with scalability enabling continuous data-driven improvement. The work highlights both the potential and constraints of automated data generation and RL-based training for scalable, general-purpose visual-language reasoning.

Abstract

Building on the success of text-based reasoning models like DeepSeek-R1, extending these capabilities to multimodal reasoning holds great promise. While recent works have attempted to adapt DeepSeek-R1-style reinforcement learning (RL) training paradigms to multimodal large language models (MLLM), focusing on domain-specific tasks like math and visual perception, a critical question remains: How can we achieve the general-purpose visual-language reasoning through RL? To address this challenge, we make three key efforts: (1) A novel Scalable Multimodal QA Synthesis pipeline that autonomously generates context-aware, reasoning-centric question-answer (QA) pairs directly from the given images. (2) The open-source WeThink dataset containing over 120K multimodal QA pairs with annotated reasoning paths, curated from 18 diverse dataset sources and covering various question domains. (3) A comprehensive exploration of RL on our dataset, incorporating a hybrid reward mechanism that combines rule-based verification with model-based assessment to optimize RL training efficiency across various task domains. Across 14 diverse MLLM benchmarks, we demonstrate that our WeThink dataset significantly enhances performance, from mathematical reasoning to diverse general multimodal tasks. Moreover, we show that our automated data pipeline can continuously increase data diversity to further improve model performance.

WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning

TL;DR

This work tackles general-purpose visual-language reasoning by combining an automated, scalable QA synthesis pipeline with a large, diverse WeThink dataset containing over 120K multimodal QA pairs and explicit reasoning paths. It introduces a hybrid reinforcement learning framework (GRPO with a mix of rule-based and model-based rewards) to train open-source vision-language models across 14 benchmarks, demonstrating that increased data diversity and reasoning-centric prompts improve cross-domain performance. Key contributions include the WeThink data generation pipeline, the 120K QA-RL-ready dataset, and empirical evidence that both supervised CoT fine-tuning and RL can enhance general multimodal reasoning, with scalability enabling continuous data-driven improvement. The work highlights both the potential and constraints of automated data generation and RL-based training for scalable, general-purpose visual-language reasoning.

Abstract

Building on the success of text-based reasoning models like DeepSeek-R1, extending these capabilities to multimodal reasoning holds great promise. While recent works have attempted to adapt DeepSeek-R1-style reinforcement learning (RL) training paradigms to multimodal large language models (MLLM), focusing on domain-specific tasks like math and visual perception, a critical question remains: How can we achieve the general-purpose visual-language reasoning through RL? To address this challenge, we make three key efforts: (1) A novel Scalable Multimodal QA Synthesis pipeline that autonomously generates context-aware, reasoning-centric question-answer (QA) pairs directly from the given images. (2) The open-source WeThink dataset containing over 120K multimodal QA pairs with annotated reasoning paths, curated from 18 diverse dataset sources and covering various question domains. (3) A comprehensive exploration of RL on our dataset, incorporating a hybrid reward mechanism that combines rule-based verification with model-based assessment to optimize RL training efficiency across various task domains. Across 14 diverse MLLM benchmarks, we demonstrate that our WeThink dataset significantly enhances performance, from mathematical reasoning to diverse general multimodal tasks. Moreover, we show that our automated data pipeline can continuously increase data diversity to further improve model performance.

Paper Structure

This paper contains 24 sections, 8 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: WeThink-VL-7B, fine-tuned on Qwen2.5-VL-7B bai2025qwen2 through reinforcement learning, shows significant improvements in tasks from mathematical reasoning to general challenges.
  • Figure 1: The distribution analysis of image types from WeThink.
  • Figure 2: The automatic process of question formulation for a given image. As illustrated by the orange line, based on the coarse description provided by Qwen2.5-VL-72B, DeepSeek-R1 needs to request additional visual details (orange text) through multi-turn conversations with Qwen2.5-VL-72B, thus facilitating the generation of context-aware, reasoning-centric questions. We also highlight that the process can condition various constraints through prompts, such as prior questions (if available), task definition, and visual focus, to control the type and focus of the questions.
  • Figure 3: The automatic process of answer construction and quality control. First, DeepSeek-R1 filters out forced and open-ended questions to ensure they are verifiable. Then, using the refined caption and the valid question, DeepSeek-R1 generates a chain of thought and an answer. At the same time, Qwen2.5-VL-72B generates an answer based on the image. If their answers match, the result is kept; if not, Gemini re-evaluates DeepSeek-R1's answer, discarding incorrect responses and keeping only the correct one with its chain of thought.
  • Figure 4: The distribution analysis of question domains and types from WeThink.
  • ...and 3 more figures