Table of Contents
Fetching ...

Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch

Yifan Zhang, Liang Hu, Haofeng Sun, Peiyu Wang, Yichen Wei, Shukang Yin, Jiangbo Pei, Wei Shen, Peng Xia, Yi Peng, Tianyidan Xie, Eric Li, Yang Liu, Xuchen Song, Yahui Zhou

TL;DR

The paper presents Skywork-R1V4, a 30B multimodal agentic model that unifies planning, thinking with images, DeepSearch, and interleaved reasoning without reinforcement learning. It leverages carefully curated supervised trajectories to ground tool use and execution, achieving state-of-the-art results on MMSearch and FVQA and exhibiting long-horizon planning with multiple tool calls. The approach demonstrates strong perception and robust, interpretable multimodal reasoning by interleaving image manipulation and web retrieval in a single trajectory. This work highlights a scalable, reproducible path to agentic multimodal intelligence driven by high-quality supervision rather than reinforcement learning.

Abstract

Despite recent progress in multimodal agentic systems, existing approaches often treat image manipulation and web search as disjoint capabilities, rely heavily on costly reinforcement learning, and lack planning grounded in real tool-execution traces. To address these limitations, we present Skywork-R1V4, a 30B (A3B) parameter multimodal agentic model that unifies multimodal planning, active image manipulation ("thinking with images"), deep multimodal search, and, most critically, interleaved reasoning that dynamically alternates between visual operations and external knowledge retrieval. Trained solely via supervised fine-tuning on fewer than 30,000 high-quality, planning-execution-consistent trajectories and validated through stepwise consistency filtering, Skywork-R1V4 achieves state-of-the-art results across perception and multimodal search benchmarks: it scores 66.1 on MMSearch and 67.2 on FVQA, surpassing Gemini 2.5 Flash on all 11 metrics. Skywork-R1V4 exhibits emergent long-horizon reasoning at inference time, successfully orchestrating more than 10 tool calls to solve complex, multi-step tasks. Our results demonstrate that sophisticated agentic multimodal intelligence can be achieved through carefully curated supervised learning alone, without any reliance on reinforcement learning.

Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch

TL;DR

The paper presents Skywork-R1V4, a 30B multimodal agentic model that unifies planning, thinking with images, DeepSearch, and interleaved reasoning without reinforcement learning. It leverages carefully curated supervised trajectories to ground tool use and execution, achieving state-of-the-art results on MMSearch and FVQA and exhibiting long-horizon planning with multiple tool calls. The approach demonstrates strong perception and robust, interpretable multimodal reasoning by interleaving image manipulation and web retrieval in a single trajectory. This work highlights a scalable, reproducible path to agentic multimodal intelligence driven by high-quality supervision rather than reinforcement learning.

Abstract

Despite recent progress in multimodal agentic systems, existing approaches often treat image manipulation and web search as disjoint capabilities, rely heavily on costly reinforcement learning, and lack planning grounded in real tool-execution traces. To address these limitations, we present Skywork-R1V4, a 30B (A3B) parameter multimodal agentic model that unifies multimodal planning, active image manipulation ("thinking with images"), deep multimodal search, and, most critically, interleaved reasoning that dynamically alternates between visual operations and external knowledge retrieval. Trained solely via supervised fine-tuning on fewer than 30,000 high-quality, planning-execution-consistent trajectories and validated through stepwise consistency filtering, Skywork-R1V4 achieves state-of-the-art results across perception and multimodal search benchmarks: it scores 66.1 on MMSearch and 67.2 on FVQA, surpassing Gemini 2.5 Flash on all 11 metrics. Skywork-R1V4 exhibits emergent long-horizon reasoning at inference time, successfully orchestrating more than 10 tool calls to solve complex, multi-step tasks. Our results demonstrate that sophisticated agentic multimodal intelligence can be achieved through carefully curated supervised learning alone, without any reliance on reinforcement learning.

Paper Structure

This paper contains 16 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Skywork-R1V4 30B (A3B) demonstrates exceptional proficiency in code-based image manipulation, text and image search, and web browsing, achieving performance on high-resolution perception benchmarks that rivals or surpasses larger-scale and specialized models, while also showing advantages in multimodal Deepsearch tasks.
  • Figure 2: (a) Our data processing pipeline. For selected QA pairs, the model first queries whether image operations are needed to enhance perception or if a direct reply is possible. If not, it generates a reasoning process and corresponding code, which is executed in a sandbox environment. The consistency between the reasoning and the sandbox output is then validated. If consistent, the result is fed back for the next iteration until the question can be answered. (b) Distribution of data functionalities, including common operations such as cropping, contrast enhancement, zooming, annotation, and pixel-level analysis.
  • Figure 3: Comparison of model efficiency. The first row presents the results from single-round inference without tool usage. The reported time, average tokens, and tokens per second (TPS) are averaged across samples within each benchmark. The second row shows the results from multi-round inference with code and search tools enabled.
  • Figure 4: Plan Mode.
  • Figure 5: Skywork-R1V4 enables dynamic visual exploration by iteratively cropping and querying different regions of an image to locate target objects. Starting from a panoramic view of Paris, the model strategically zooms into high-activity zones (e.g., parks and sidewalks), progressively refining its focus until it successfully identifies a small white dog — demonstrating adaptive reasoning and spatial navigation for fine-grained visual understanding.
  • ...and 2 more figures