Table of Contents
Fetching ...

Skywork-R1V3 Technical Report

Wei Shen, Jiangbo Pei, Yi Peng, Xuchen Song, Yang Liu, Jian Peng, Haofeng Sun, Yunzhuo Hao, Peiyu Wang, Jianhao Zhang, Yahui Zhou

TL;DR

Skywork-R1V3 introduces an open-source vision-language model that closes the reasoning gap with text-only LLMs by leveraging a reinforcement-learning post-training framework and a connector module for cross-modal alignment. The approach combines cold-start supervised priming, PPO/GRPO-style RL, and connector-focused fine-tuning to transfer and generalize reasoning across domains, achieving state-of-the-art performance on MMMU among open-source VLMs (e.g., 76.0% accuracy) and strong results on math, logic, and physics benchmarks. A novel metric, critical-token entropy, guides checkpoint selection, and ablations highlight the connector’s central role in stable cross-modal reasoning; connector-only tuning after RL significantly improves cross-domain performance without disrupting reasoning. The results demonstrate that RL-based post-training can unlock open-source VLMs with robust multimodal reasoning and broad generalization, signaling a path toward scalable, domain-agnostic multimodal intelligence and informing future directions in tool use, unified VLMs, and embodied reasoning.

Abstract

We introduce Skywork-R1V3, an advanced, open-source vision-language model (VLM) that pioneers a new approach to visual reasoning. Its key innovation lies in effectively transferring reasoning skills from text-only Large Language Models (LLMs) to visual tasks. The strong performance of Skywork-R1V3 primarily stems from our elaborate post-training RL framework, which effectively activates and enhances the model's reasoning ability, without the need for additional continue pre-training. Through this framework, we further uncover the fundamental role of the connector module in achieving robust cross-modal alignment for multimodal reasoning models. In addition, we introduce a unique indicator of reasoning capability, the entropy of critical reasoning tokens, which has proven highly effective for checkpoint selection during RL training. Skywork-R1V3 achieves state-of-the-art results on MMMU, significantly improving from 64.3% to 76.0%. This performance matches entry-level human capabilities. Remarkably, our RL-powered post-training approach enables even the 38B parameter model to rival top closed-source VLMs. The implementation successfully transfers mathematical reasoning to other subject-related reasoning tasks. We also include an analysis of curriculum learning and reinforcement finetuning strategies, along with a broader discussion on multimodal reasoning. Skywork-R1V3 represents a significant leap in multimodal reasoning, showcasing RL as a powerful engine for advancing open-source VLM capabilities.

Skywork-R1V3 Technical Report

TL;DR

Skywork-R1V3 introduces an open-source vision-language model that closes the reasoning gap with text-only LLMs by leveraging a reinforcement-learning post-training framework and a connector module for cross-modal alignment. The approach combines cold-start supervised priming, PPO/GRPO-style RL, and connector-focused fine-tuning to transfer and generalize reasoning across domains, achieving state-of-the-art performance on MMMU among open-source VLMs (e.g., 76.0% accuracy) and strong results on math, logic, and physics benchmarks. A novel metric, critical-token entropy, guides checkpoint selection, and ablations highlight the connector’s central role in stable cross-modal reasoning; connector-only tuning after RL significantly improves cross-domain performance without disrupting reasoning. The results demonstrate that RL-based post-training can unlock open-source VLMs with robust multimodal reasoning and broad generalization, signaling a path toward scalable, domain-agnostic multimodal intelligence and informing future directions in tool use, unified VLMs, and embodied reasoning.

Abstract

We introduce Skywork-R1V3, an advanced, open-source vision-language model (VLM) that pioneers a new approach to visual reasoning. Its key innovation lies in effectively transferring reasoning skills from text-only Large Language Models (LLMs) to visual tasks. The strong performance of Skywork-R1V3 primarily stems from our elaborate post-training RL framework, which effectively activates and enhances the model's reasoning ability, without the need for additional continue pre-training. Through this framework, we further uncover the fundamental role of the connector module in achieving robust cross-modal alignment for multimodal reasoning models. In addition, we introduce a unique indicator of reasoning capability, the entropy of critical reasoning tokens, which has proven highly effective for checkpoint selection during RL training. Skywork-R1V3 achieves state-of-the-art results on MMMU, significantly improving from 64.3% to 76.0%. This performance matches entry-level human capabilities. Remarkably, our RL-powered post-training approach enables even the 38B parameter model to rival top closed-source VLMs. The implementation successfully transfers mathematical reasoning to other subject-related reasoning tasks. We also include an analysis of curriculum learning and reinforcement finetuning strategies, along with a broader discussion on multimodal reasoning. Skywork-R1V3 represents a significant leap in multimodal reasoning, showcasing RL as a powerful engine for advancing open-source VLM capabilities.

Paper Structure

This paper contains 51 sections, 13 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 2: Data distribution across the three training stages.
  • Figure 3: The Performance of Skywork-R1V3-38B on PhyX-MC-Text-Minimal
  • Figure 4: Model Rankings on 2025 GAOKAO Math
  • Figure 5: The entropy of critical token vs. MMMU accuracy
  • Figure 6: Ablation Studies of Module Activation Impact on MathVista Performance
  • ...and 9 more figures