Table of Contents
Fetching ...

StAR: Segment Anything Reasoner

Seokju Yun, Dongheon Lee, Noori Bae, Jaesung Jun, Chanseul Cho, Youngmin Ro

Abstract

As AI systems are being integrated more rapidly into diverse and complex real-world environments, the ability to perform holistic reasoning over an implicit query and an image to localize a target is becoming increasingly important. However, recent reasoning segmentation methods fail to sufficiently elicit the visual reasoning capabilities of the base mode. In this work, we present Segment Anything Reasoner (StAR), a comprehensive framework that refines the design space from multiple perspectives-including parameter-tuning scheme, reward functions, learning strategies and answer format-and achieves substantial improvements over recent baselines. In addition, for the first time, we successfully introduce parallel test-time scaling to the segmentation task, pushing the performance boundary even further. To extend the scope and depth of reasoning covered by existing benchmark, we also construct the ReasonSeg-X, which compactly defines reasoning types and includes samples that require deeper reasoning. Leveraging this dataset, we train StAR with a rollout-expanded selective-tuning approach to activate the base model's latent reasoning capabilities, and establish a rigorous benchmark for systematic, fine-grained evaluation of advanced methods. With only 5k training samples, StAR achieves significant gains over its base counterparts across extensive benchmarks, demonstrating that our method effectively brings dormant reasoning competence to the surface.

StAR: Segment Anything Reasoner

Abstract

As AI systems are being integrated more rapidly into diverse and complex real-world environments, the ability to perform holistic reasoning over an implicit query and an image to localize a target is becoming increasingly important. However, recent reasoning segmentation methods fail to sufficiently elicit the visual reasoning capabilities of the base mode. In this work, we present Segment Anything Reasoner (StAR), a comprehensive framework that refines the design space from multiple perspectives-including parameter-tuning scheme, reward functions, learning strategies and answer format-and achieves substantial improvements over recent baselines. In addition, for the first time, we successfully introduce parallel test-time scaling to the segmentation task, pushing the performance boundary even further. To extend the scope and depth of reasoning covered by existing benchmark, we also construct the ReasonSeg-X, which compactly defines reasoning types and includes samples that require deeper reasoning. Leveraging this dataset, we train StAR with a rollout-expanded selective-tuning approach to activate the base model's latent reasoning capabilities, and establish a rigorous benchmark for systematic, fine-grained evaluation of advanced methods. With only 5k training samples, StAR achieves significant gains over its base counterparts across extensive benchmarks, demonstrating that our method effectively brings dormant reasoning competence to the surface.
Paper Structure (34 sections, 8 equations, 16 figures, 9 tables)

This paper contains 34 sections, 8 equations, 16 figures, 9 tables.

Figures (16)

  • Figure 1: We establish a reasoning segmentation benchmark, ReasonSeg-X, that demands a broad range of reasoning skills. Our dataset addresses goal-oriented reasoning and the ability to flexibly invoke world knowledge, while also covering complex relational reasoning and step-by-step reasoning capabilities. Our model, StAR (equipped with Qwen3-VL 32B Qwen3-VL), demonstrates remarkable performance across all these aspects (right image). Additional results are provided in the supplementary material.
  • Figure 2: ReasonSeg-R examples. We correct masks for cases that are mismatched to the query and also refine mask quality. In addition, we modify query expressions that may inadvertently include regions outside the mask (e.g., the video player monitor).
  • Figure 3: Overview of the StAR framework pipeline. We pinpoint and resolve reasoning bottlenecks from all perspectives of the existing reasoning segmentation framework (highlighted in red text). See text for details.
  • Figure 4: (a) We progressively retrofit all pillars of the standard RLVR framework (VisionReasoner) toward our performant and robust model (StAR), without introducing significant computational overhead. The foreground bars show performance on ReasonSeg-X test, while results on ReasonSeg-R are shown with gray bars. The performance values in parentheses are obtained with our majority voting strategy (Sec. \ref{['sec: mv']}). Gray-marked design decisions (y-axis label texts) indicate comparison-only variants that are not adopted. (b) The proposed RLVR design space better preserves and exploits the base MLLM’s diverse reasoning capabilities (VisionReasoner $\rightarrow$ StAR stage-1). Moreover, Stage 2 training on the ReasonSeg-X train set with rollout-expanded selective-tuning (REST) effectively elicits the MLLM’s hidden reasoning potential, improving performance particularly on complex reasoning (e.g., C/MH). The resulting model, StAR, outperforms the baseline by a large margin across all reasoning types.
  • Figure 5: Motivation for the mask IoU reward.Left: Misalignment between the MLLM-level reward function and the task's final objective potentially induces confusing learning signals. Right: Distinguishing between diverse mask defects via a fine-grained tiered mask reward promotes stable and stepwise optimization.
  • ...and 11 more figures