Table of Contents
Fetching ...

One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Lei Liu, Zheng Zhang, Mike Zheng Shou

TL;DR

This work tackles language-guided segmentation in videos by marrying a multimodal language model with video-specific temporal reasoning. The core innovations are Sparse Dense Sampling, which balances dense spatial detail and temporal context, and One-Token-Seg-All, which uses a single <TRK> token to track and segment targets across frames. VideoLISA integrates a visual tokenizer, a vision encoder from SAM, and a Phi-3-based LLM (via LLaVA) to generate frame-level masks, trained with a combination of image and video data and losses $L_{txt}$, $L_{seg}$ (BCE and Dice). The approach achieves strong results on RVOS benchmarks, the new ReasonVOS dataset, and demonstrates notable generalization to image segmentation, indicating potential as a unified language-instructed segmentation foundation model for both video and image domains.

Abstract

We introduce VideoLISA, a video-based multimodal large language model designed to tackle the problem of language-instructed reasoning segmentation in videos. Leveraging the reasoning capabilities and world knowledge of large language models, and augmented by the Segment Anything Model, VideoLISA generates temporally consistent segmentation masks in videos based on language instructions. Existing image-based methods, such as LISA, struggle with video tasks due to the additional temporal dimension, which requires temporal dynamic understanding and consistent segmentation across frames. VideoLISA addresses these challenges by integrating a Sparse Dense Sampling strategy into the video-LLM, which balances temporal context and spatial detail within computational constraints. Additionally, we propose a One-Token-Seg-All approach using a specially designed <TRK> token, enabling the model to segment and track objects across multiple frames. Extensive evaluations on diverse benchmarks, including our newly introduced ReasonVOS benchmark, demonstrate VideoLISA's superior performance in video object segmentation tasks involving complex reasoning, temporal understanding, and object tracking. While optimized for videos, VideoLISA also shows promising generalization to image segmentation, revealing its potential as a unified foundation model for language-instructed object segmentation. Code and model will be available at: https://github.com/showlab/VideoLISA.

One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

TL;DR

This work tackles language-guided segmentation in videos by marrying a multimodal language model with video-specific temporal reasoning. The core innovations are Sparse Dense Sampling, which balances dense spatial detail and temporal context, and One-Token-Seg-All, which uses a single <TRK> token to track and segment targets across frames. VideoLISA integrates a visual tokenizer, a vision encoder from SAM, and a Phi-3-based LLM (via LLaVA) to generate frame-level masks, trained with a combination of image and video data and losses , (BCE and Dice). The approach achieves strong results on RVOS benchmarks, the new ReasonVOS dataset, and demonstrates notable generalization to image segmentation, indicating potential as a unified language-instructed segmentation foundation model for both video and image domains.

Abstract

We introduce VideoLISA, a video-based multimodal large language model designed to tackle the problem of language-instructed reasoning segmentation in videos. Leveraging the reasoning capabilities and world knowledge of large language models, and augmented by the Segment Anything Model, VideoLISA generates temporally consistent segmentation masks in videos based on language instructions. Existing image-based methods, such as LISA, struggle with video tasks due to the additional temporal dimension, which requires temporal dynamic understanding and consistent segmentation across frames. VideoLISA addresses these challenges by integrating a Sparse Dense Sampling strategy into the video-LLM, which balances temporal context and spatial detail within computational constraints. Additionally, we propose a One-Token-Seg-All approach using a specially designed <TRK> token, enabling the model to segment and track objects across multiple frames. Extensive evaluations on diverse benchmarks, including our newly introduced ReasonVOS benchmark, demonstrate VideoLISA's superior performance in video object segmentation tasks involving complex reasoning, temporal understanding, and object tracking. While optimized for videos, VideoLISA also shows promising generalization to image segmentation, revealing its potential as a unified foundation model for language-instructed object segmentation. Code and model will be available at: https://github.com/showlab/VideoLISA.
Paper Structure (30 sections, 7 figures, 10 tables)

This paper contains 30 sections, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Framework of our approach.
  • Figure 2: Exploration of One-Token-Seg-All approach.
  • Figure 3: VideoLISA is a capable model on video object segmentation with versatile language-instructed reasoning abilities. Beyond basic language referring, it enables complex reasoning by leveraging world knowledge and videos temporal dynamics.
  • Figure 4: Failure cases of VideoLISA.
  • Figure 5: ReasonVOS benchmark. The left part shows the statistics of data samples. The right part shows the source of the videos.
  • ...and 2 more figures