Table of Contents
Fetching ...

VISA: Reasoning Video Object Segmentation via Large Language Models

Cilin Yan, Haochen Wang, Shilin Yan, Xiaolong Jiang, Yao Hu, Guoliang Kang, Weidi Xie, Efstratios Gavves

TL;DR

ReasonVOS seeks segmentation from implicit, world-knowledge–driven text in videos. The authors introduce VISA, a three-part framework combining a Text-guided Frame Sampler, a multi-modal LLM, and a SAM-based segmentation head, with an Object Tracker to yield a complete mask sequence, trained via joint text-generation and segmentation losses. To support learning and evaluation of reasoning-based segmentation, they present ReVOS, a large-scale dataset with 35,074 instruction–mask pairs from 1,042 videos. Across eight datasets, VISA achieves state-of-the-art performance on ReasonVOS in both video and image domains, demonstrating strong reasoning capabilities and generalization for embodied AI tasks.

Abstract

Existing Video Object Segmentation (VOS) relies on explicit user instructions, such as categories, masks, or short phrases, restricting their ability to perform complex video segmentation requiring reasoning with world knowledge. In this paper, we introduce a new task, Reasoning Video Object Segmentation (ReasonVOS). This task aims to generate a sequence of segmentation masks in response to implicit text queries that require complex reasoning abilities based on world knowledge and video contexts, which is crucial for structured environment understanding and object-centric interactions, pivotal in the development of embodied AI. To tackle ReasonVOS, we introduce VISA (Video-based large language Instructed Segmentation Assistant), to leverage the world knowledge reasoning capabilities of multi-modal LLMs while possessing the ability to segment and track objects in videos with a mask decoder. Moreover, we establish a comprehensive benchmark consisting of 35,074 instruction-mask sequence pairs from 1,042 diverse videos, which incorporates complex world knowledge reasoning into segmentation tasks for instruction-tuning and evaluation purposes of ReasonVOS models. Experiments conducted on 8 datasets demonstrate the effectiveness of VISA in tackling complex reasoning segmentation and vanilla referring segmentation in both video and image domains. The code and dataset are available at https://github.com/cilinyan/VISA.

VISA: Reasoning Video Object Segmentation via Large Language Models

TL;DR

ReasonVOS seeks segmentation from implicit, world-knowledge–driven text in videos. The authors introduce VISA, a three-part framework combining a Text-guided Frame Sampler, a multi-modal LLM, and a SAM-based segmentation head, with an Object Tracker to yield a complete mask sequence, trained via joint text-generation and segmentation losses. To support learning and evaluation of reasoning-based segmentation, they present ReVOS, a large-scale dataset with 35,074 instruction–mask pairs from 1,042 videos. Across eight datasets, VISA achieves state-of-the-art performance on ReasonVOS in both video and image domains, demonstrating strong reasoning capabilities and generalization for embodied AI tasks.

Abstract

Existing Video Object Segmentation (VOS) relies on explicit user instructions, such as categories, masks, or short phrases, restricting their ability to perform complex video segmentation requiring reasoning with world knowledge. In this paper, we introduce a new task, Reasoning Video Object Segmentation (ReasonVOS). This task aims to generate a sequence of segmentation masks in response to implicit text queries that require complex reasoning abilities based on world knowledge and video contexts, which is crucial for structured environment understanding and object-centric interactions, pivotal in the development of embodied AI. To tackle ReasonVOS, we introduce VISA (Video-based large language Instructed Segmentation Assistant), to leverage the world knowledge reasoning capabilities of multi-modal LLMs while possessing the ability to segment and track objects in videos with a mask decoder. Moreover, we establish a comprehensive benchmark consisting of 35,074 instruction-mask sequence pairs from 1,042 diverse videos, which incorporates complex world knowledge reasoning into segmentation tasks for instruction-tuning and evaluation purposes of ReasonVOS models. Experiments conducted on 8 datasets demonstrate the effectiveness of VISA in tackling complex reasoning segmentation and vanilla referring segmentation in both video and image domains. The code and dataset are available at https://github.com/cilinyan/VISA.
Paper Structure (14 sections, 5 equations, 6 figures, 6 tables)

This paper contains 14 sections, 5 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: We enable the reasoning video object segmentation capabilities for current multi-modal LLMs. The proposed VISA is capable of segmenting and tracking objects given text descriptions involving: (a) complex reasoning of world knowledge; (b) inference of upcoming events; and (c) comprehensive understanding of video content.
  • Figure 2: Our proposed VISA consistently achieves state-of-the-art performances on video and image datasets over reasoning and referring segmentation tasks. $\mathcal{J}$ is region similarity seo2020urvos, $\mathcal{F}$ is contour accuracy seo2020urvos, and $\mathcal{R}$ is robustness score li2022r.
  • Figure 3: Overview of VISA. (a) Given a video $\mathbf{x}_v$ and a text description $\mathbf{x}_t$, a Text-guided Frame Sampler (TFS) is proposed to sample the most distinguishing frame $f_{tgt}$ as the target to be segmented and corresponding reference frames $\mathbf{x}_r$. (b) Then $f_{tgt}$, $\mathbf{x}_r$, and $\mathbf{x}_t$ are tokenized and fed to a Multi-Modal LLM to generate text output, including a special token $<$Seg$>$. The last-layer embedding of $<$Seg$>$ token $h_{seg}$ is then decoded into the segmentation mask $m_{tgt}$ of frame $f_{tgt}$ via the mask decoder. (c) Finally, the segmentation masks of all frames $\mathcal{M}$ are generated by propagation with an Object Tracker. The modules in blue are frozen during the training, while the modules in pink are trainable.
  • Figure 4: Visualizations of VISA on ReVOS dataset.
  • Figure 5: Heatmaps of the target frame $f_{tgt}$. To draw the heatmap, we generate 10 responses with the Text-guided Frame Sampler (TFS) and obtain the normalized distribution. As shown, the highlighted frames are related to the text queries.
  • ...and 1 more figures