Table of Contents
Fetching ...

Seg-ReSearch: Segmentation with Interleaved Reasoning and External Search

Tianming Liang, Qirui Du, Jian-Fang Hu, Haichao Jiang, Zicheng Lin, Wei-Shi Zheng

TL;DR

Seg-ReSearch tackles open-world video segmentation by interleaving multi-turn reasoning with external web search to identify language-referred targets beyond the frozen knowledge of multimodal LLMs. It introduces a hierarchical reward design—comprising Initial Guidance Reward, Tapering Process Reward, and Outcome Reward—and trains with Group Relative Policy Optimization to balance exploration and task accuracy. A new OK-VOS benchmark is proposed to explicitly require outside knowledge, enabling rigorous evaluation of open-world visual reasoning. Empirical results show substantial gains over SOTA baselines on OK-VOS and new SOTA on ReasonSeg and ReasonVOS, highlighting the practical potential for open-world, knowledge-augmented visual agents.

Abstract

Segmentation based on language has been a popular topic in computer vision. While recent advances in multimodal large language models (MLLMs) have endowed segmentation systems with reasoning capabilities, these efforts remain confined by the frozen internal knowledge of MLLMs, which limits their potential for real-world scenarios that involve up-to-date information or domain-specific concepts. In this work, we propose \textbf{Seg-ReSearch}, a novel segmentation paradigm that overcomes the knowledge bottleneck of existing approaches. By enabling interleaved reasoning and external search, Seg-ReSearch empowers segmentation systems to handle dynamic, open-world queries that extend beyond the frozen knowledge of MLLMs. To effectively train this capability, we introduce a hierarchical reward design that harmonizes initial guidance with progressive incentives, mitigating the dilemma between sparse outcome signals and rigid step-wise supervision. For evaluation, we construct OK-VOS, a challenging benchmark that explicitly requires outside knowledge for video object segmentation. Experiments on OK-VOS and two existing reasoning segmentation benchmarks demonstrate that our Seg-ReSearch improves state-of-the-art approaches by a substantial margin. Code and data will be released at https://github.com/iSEE-Laboratory/Seg-ReSearch.

Seg-ReSearch: Segmentation with Interleaved Reasoning and External Search

TL;DR

Seg-ReSearch tackles open-world video segmentation by interleaving multi-turn reasoning with external web search to identify language-referred targets beyond the frozen knowledge of multimodal LLMs. It introduces a hierarchical reward design—comprising Initial Guidance Reward, Tapering Process Reward, and Outcome Reward—and trains with Group Relative Policy Optimization to balance exploration and task accuracy. A new OK-VOS benchmark is proposed to explicitly require outside knowledge, enabling rigorous evaluation of open-world visual reasoning. Empirical results show substantial gains over SOTA baselines on OK-VOS and new SOTA on ReasonSeg and ReasonVOS, highlighting the practical potential for open-world, knowledge-augmented visual agents.

Abstract

Segmentation based on language has been a popular topic in computer vision. While recent advances in multimodal large language models (MLLMs) have endowed segmentation systems with reasoning capabilities, these efforts remain confined by the frozen internal knowledge of MLLMs, which limits their potential for real-world scenarios that involve up-to-date information or domain-specific concepts. In this work, we propose \textbf{Seg-ReSearch}, a novel segmentation paradigm that overcomes the knowledge bottleneck of existing approaches. By enabling interleaved reasoning and external search, Seg-ReSearch empowers segmentation systems to handle dynamic, open-world queries that extend beyond the frozen knowledge of MLLMs. To effectively train this capability, we introduce a hierarchical reward design that harmonizes initial guidance with progressive incentives, mitigating the dilemma between sparse outcome signals and rigid step-wise supervision. For evaluation, we construct OK-VOS, a challenging benchmark that explicitly requires outside knowledge for video object segmentation. Experiments on OK-VOS and two existing reasoning segmentation benchmarks demonstrate that our Seg-ReSearch improves state-of-the-art approaches by a substantial margin. Code and data will be released at https://github.com/iSEE-Laboratory/Seg-ReSearch.
Paper Structure (18 sections, 7 equations, 6 figures, 10 tables)

This paper contains 18 sections, 7 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Through multi-turn interleaved reasoning and web search, our Seg-ReSearch is able to localize and segment any language-referred target in videos, even those involving new concepts or up-to-date information that lies beyond the internal knowledge of MLLMs.
  • Figure 2: In order to identify the target objects involving new information, Seg-ReSearch conducts multi-turn interactions with the external search engine throughout the dynamic Multi-modal Chain-of-Thought (MCoT). This capability is incentivized by a 3-level reward structure: IGR pilots the initial planning, TPR encourages extensive exploration, and OR ensures final task accuracy.
  • Figure 3: The growth curve of TPR with increasing action turns.
  • Figure 4: Left: number of search calls over training steps. Right: overall performance of various methods.
  • Figure 5: Qualitative comparison between our Seg-ReSearch and the baselines: Qwen3-VL-8B* and Qwen3-VL-8B*+Search.
  • ...and 1 more figures