Table of Contents
Fetching ...

DeepEyesV2: Toward Agentic Multimodal Model

Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, Xing Yu

TL;DR

<p>DeepEyesV2 tackles the challenge of agentic multimodal reasoning by enabling dynamic tool invocation (code execution and web search) within an iterative reasoning loop. Direct reinforcement learning struggles to induce robust tool use, motivating a two-stage training pipeline: a cold-start supervised phase to bootstrap tool-use patterns, followed by reinforcement learning to refine and adapt tool invocation. A carefully curated, diverse dataset supports the cold-start stage, and RealX-Bench provides a comprehensive benchmark to evaluate coordinated perception, search, and reasoning in real-world scenarios. Across real-world understanding, mathematical reasoning, and search-intensive tasks, DeepEyesV2 demonstrates task-adaptive tool use and improved performance, offering a practical path toward more reliable and explainable agentic multimodal systems.</p>

Abstract

Agentic multimodal models should not only comprehend text and images, but also actively invoke external tools, such as code execution environments and web search, and integrate these operations into reasoning. In this work, we introduce DeepEyesV2 and explore how to build an agentic multimodal model from the perspectives of data construction, training methods, and model evaluation. We observe that direct reinforcement learning alone fails to induce robust tool-use behavior. This phenomenon motivates a two-stage training pipeline: a cold-start stage to establish tool-use patterns, and reinforcement learning stage to further refine tool invocation. We curate a diverse, moderately challenging training dataset, specifically including examples where tool use is beneficial. We further introduce RealX-Bench, a comprehensive benchmark designed to evaluate real-world multimodal reasoning, which inherently requires the integration of multiple capabilities, including perception, search, and reasoning. We evaluate DeepEyesV2 on RealX-Bench and other representative benchmarks, demonstrating its effectiveness across real-world understanding, mathematical reasoning, and search-intensive tasks. Moreover, DeepEyesV2 exhibits task-adaptive tool invocation, tending to use image operations for perception tasks and numerical computations for reasoning tasks. Reinforcement learning further enables complex tool combinations and allows model to selectively invoke tools based on context. We hope our study can provide guidance for community in developing agentic multimodal models.

DeepEyesV2: Toward Agentic Multimodal Model

TL;DR

<p>DeepEyesV2 tackles the challenge of agentic multimodal reasoning by enabling dynamic tool invocation (code execution and web search) within an iterative reasoning loop. Direct reinforcement learning struggles to induce robust tool use, motivating a two-stage training pipeline: a cold-start supervised phase to bootstrap tool-use patterns, followed by reinforcement learning to refine and adapt tool invocation. A carefully curated, diverse dataset supports the cold-start stage, and RealX-Bench provides a comprehensive benchmark to evaluate coordinated perception, search, and reasoning in real-world scenarios. Across real-world understanding, mathematical reasoning, and search-intensive tasks, DeepEyesV2 demonstrates task-adaptive tool use and improved performance, offering a practical path toward more reliable and explainable agentic multimodal systems.</p>

Abstract

Agentic multimodal models should not only comprehend text and images, but also actively invoke external tools, such as code execution environments and web search, and integrate these operations into reasoning. In this work, we introduce DeepEyesV2 and explore how to build an agentic multimodal model from the perspectives of data construction, training methods, and model evaluation. We observe that direct reinforcement learning alone fails to induce robust tool-use behavior. This phenomenon motivates a two-stage training pipeline: a cold-start stage to establish tool-use patterns, and reinforcement learning stage to further refine tool invocation. We curate a diverse, moderately challenging training dataset, specifically including examples where tool use is beneficial. We further introduce RealX-Bench, a comprehensive benchmark designed to evaluate real-world multimodal reasoning, which inherently requires the integration of multiple capabilities, including perception, search, and reasoning. We evaluate DeepEyesV2 on RealX-Bench and other representative benchmarks, demonstrating its effectiveness across real-world understanding, mathematical reasoning, and search-intensive tasks. Moreover, DeepEyesV2 exhibits task-adaptive tool invocation, tending to use image operations for perception tasks and numerical computations for reasoning tasks. Reinforcement learning further enables complex tool combinations and allows model to selectively invoke tools based on context. We hope our study can provide guidance for community in developing agentic multimodal models.

Paper Structure

This paper contains 24 sections, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Illustration of agentic multimodal models. (a) Existing models show unsatisfactory performance in real-world scenarios, showing clear limitations especially when perception, reasoning, and search must be tightly integrated. (b) A multi-step visual reasoning example requiring coordinated perception, search, and reasoning.
  • Figure 2: Case reasoning trajectory of DeepEyesV2.DeepEyesV2 seamlessly integrates code execution and web search within its iterative reasoning process. Notably, in the right case, the behavior of accessing webpages via code does not exist in cold start data and is spontaneously acquired during reinforcement learning.
  • Figure 3: Pipeline of DeepEyesV2.DeepEyesV2 invokes tools and incorporates execution results into subsequent reasoning steps, enabling iterative and tool-augmented multimodal inference.
  • Figure 4: Pioneer Experiments reveal that existing multimodal models cannot directly acquire reliable tool use ability through RL, demonstrating the necessity of a cold start phase. The red dashed line represents tool calls number in a single rollout, and the blue solid line represents the averge response length.
  • Figure 5: Statistics of RealX-Bench. (a) Domain distribution across five representative categories: Daily Life, Media, Sports, Knowledge, and Games. (b) Distribution of subsets classified by required abilities: perception, reasoning, search, and integration. These numbers may overlap because the challenges are not mutually exclusive. Integration denotes questions that are difficult across all three abilities simultaneously.
  • ...and 7 more figures