DeepEyesV2: Toward Agentic Multimodal Model

Jack Hong; Chenxiao Zhao; ChengLin Zhu; Weiheng Lu; Guohai Xu; Xing Yu

DeepEyesV2: Toward Agentic Multimodal Model

Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, Xing Yu

TL;DR

<p>DeepEyesV2 tackles the challenge of agentic multimodal reasoning by enabling dynamic tool invocation (code execution and web search) within an iterative reasoning loop. Direct reinforcement learning struggles to induce robust tool use, motivating a two-stage training pipeline: a cold-start supervised phase to bootstrap tool-use patterns, followed by reinforcement learning to refine and adapt tool invocation. A carefully curated, diverse dataset supports the cold-start stage, and RealX-Bench provides a comprehensive benchmark to evaluate coordinated perception, search, and reasoning in real-world scenarios. Across real-world understanding, mathematical reasoning, and search-intensive tasks, DeepEyesV2 demonstrates task-adaptive tool use and improved performance, offering a practical path toward more reliable and explainable agentic multimodal systems.</p>

Abstract

Agentic multimodal models should not only comprehend text and images, but also actively invoke external tools, such as code execution environments and web search, and integrate these operations into reasoning. In this work, we introduce DeepEyesV2 and explore how to build an agentic multimodal model from the perspectives of data construction, training methods, and model evaluation. We observe that direct reinforcement learning alone fails to induce robust tool-use behavior. This phenomenon motivates a two-stage training pipeline: a cold-start stage to establish tool-use patterns, and reinforcement learning stage to further refine tool invocation. We curate a diverse, moderately challenging training dataset, specifically including examples where tool use is beneficial. We further introduce RealX-Bench, a comprehensive benchmark designed to evaluate real-world multimodal reasoning, which inherently requires the integration of multiple capabilities, including perception, search, and reasoning. We evaluate DeepEyesV2 on RealX-Bench and other representative benchmarks, demonstrating its effectiveness across real-world understanding, mathematical reasoning, and search-intensive tasks. Moreover, DeepEyesV2 exhibits task-adaptive tool invocation, tending to use image operations for perception tasks and numerical computations for reasoning tasks. Reinforcement learning further enables complex tool combinations and allows model to selectively invoke tools based on context. We hope our study can provide guidance for community in developing agentic multimodal models.

DeepEyesV2: Toward Agentic Multimodal Model

TL;DR

Abstract

DeepEyesV2: Toward Agentic Multimodal Model

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)