Table of Contents
Fetching ...

Reasoning Segmentation for Images and Videos: A Survey

Yiqing Shen, Chenjia Li, Fei Xiong, Jeong-O Jeong, Tianpeng Wang, Michael Latman, Mathias Unberath

TL;DR

Reasoning Segmentation defines a new task that produces pixel-level masks from implicit text queries, requiring joint visual understanding and world knowledge. The survey catalogs 26 image and video RS methods that fuse multimodal language models with segmentation backbones, including end-to-end, decoupled reasoning, and conversational architectures, many leveraging SAM, DINOv2, and LoRA. It also reviews 29 image and 10 video RS datasets and benchmarks, plus a range of evaluation metrics that emphasize both mask quality and text-based reasoning, while highlighting gaps in multi-step reasoning, robustness, and domain-specific applications. The work identifies challenges such as dependence on external segmentation models, heavy computational demands, and a lack of standardized reasoning-focused metrics, and outlines directions toward expanded reasoning, richer evaluation, and broader modality integration with practical impact across safety, surveillance, healthcare, and autonomous systems.

Abstract

Reasoning Segmentation (RS) aims to delineate objects based on implicit text queries, the interpretation of which requires reasoning and knowledge integration. Unlike the traditional formulation of segmentation problems that relies on fixed semantic categories or explicit prompting, RS bridges the gap between visual perception and human-like reasoning capabilities, facilitating more intuitive human-AI interaction through natural language. Our work presents the first comprehensive survey of RS for image and video processing, examining 26 state-of-the-art methods together with a review of the corresponding evaluation metrics, as well as 29 datasets and benchmarks. We also explore existing applications of RS across diverse domains and identify their potential extensions. Finally, we identify current research gaps and highlight promising future directions.

Reasoning Segmentation for Images and Videos: A Survey

TL;DR

Reasoning Segmentation defines a new task that produces pixel-level masks from implicit text queries, requiring joint visual understanding and world knowledge. The survey catalogs 26 image and video RS methods that fuse multimodal language models with segmentation backbones, including end-to-end, decoupled reasoning, and conversational architectures, many leveraging SAM, DINOv2, and LoRA. It also reviews 29 image and 10 video RS datasets and benchmarks, plus a range of evaluation metrics that emphasize both mask quality and text-based reasoning, while highlighting gaps in multi-step reasoning, robustness, and domain-specific applications. The work identifies challenges such as dependence on external segmentation models, heavy computational demands, and a lack of standardized reasoning-focused metrics, and outlines directions toward expanded reasoning, richer evaluation, and broader modality integration with practical impact across safety, surveillance, healthcare, and autonomous systems.

Abstract

Reasoning Segmentation (RS) aims to delineate objects based on implicit text queries, the interpretation of which requires reasoning and knowledge integration. Unlike the traditional formulation of segmentation problems that relies on fixed semantic categories or explicit prompting, RS bridges the gap between visual perception and human-like reasoning capabilities, facilitating more intuitive human-AI interaction through natural language. Our work presents the first comprehensive survey of RS for image and video processing, examining 26 state-of-the-art methods together with a review of the corresponding evaluation metrics, as well as 29 datasets and benchmarks. We also explore existing applications of RS across diverse domains and identify their potential extensions. Finally, we identify current research gaps and highlight promising future directions.

Paper Structure

This paper contains 98 sections, 22 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Illustration of various segmentation tasks on a street-crossing scene. (a) Original image of the scene. (b) Semantic segmentation where each pixel is assigned to a predefined category (e.g., person, road, vehicle) without distinguishing individual instances. (c) Instance segmentation, which differentiates individual objects within the same semantic category (e.g., multiple pedestrians). (d) Panoptic segmentation, combining both countable foreground objects and uncountable background elements. The upper text bubble shows a "referring query" with a direct descriptive expression ("The woman on the pedestrian crossing"), while the lower bubble presents a "reasoning query" requiring multi-step inference ("Which pedestrian walking just behind the man, with short silvery white hair, a navy sweater, cream trousers, and carrying a small dark handbag in her left hand"). Panel (e)/(f) shows the resulting segmentation masks for both the referring and reasoning queries, highlighting how reasoning segmentation can handle more complex, implicit descriptions beyond directly observable attributes.
  • Figure 2: Timeline of RS methods from October 2023 to April 2025, illustrating its rapid evolution. Blue and purple labels distinguish between image-based and video-based RS approaches, respectively. The timeline showcases the sequential development of these methods, beginning with LISA as the pioneering work, followed by numerous innovations expanding the capabilities of RS across both modalities.
  • Figure 3: Overview of the LISA lisa, which adopts an "embedding-as-mask" for RS.
  • Figure 4: Comparison of training approaches for RS models: (a) Supervised fine-tuning as in LISA and variants lisa, and (b) Reinforcement learning optimizes the model using a reward function that combines format rewards and accuracy rewards, as implemented in Seg-Zero segzero.
  • Figure 5: Illustrative example of image RS data.
  • ...and 2 more figures