Table of Contents
Fetching ...

UniVS: Unified and Universal Video Segmentation with Prompts as Queries

Minghan Li, Shuai Li, Xindong Zhang, Lei Zhang

TL;DR

UniVS addresses the fragmented landscape of video segmentation by unifying category- and prompt-guided tasks under a single architecture. It treats prompts as queries and uses an averaged prompt-derived initial query plus a target-wise prompt cross-attention mechanism to decode masks, with a memory pool that updates prompts frame-by-frame, eliminating heuristic inter-frame matching. The model comprises an Image Encoder, a Prompt Encoder, and a Unified Video Mask Decoder that processes prompts and images through ProCA, image cross-attention, and separated self-attention, enabling prompt-guided target segmentation across frames. Empirically, UniVS achieves competitive or state-of-the-art results across 10 VS benchmarks (VIS, VSS, VPS, VOS, RefVOS, PVOS), including 92.3 mVC$_8$ on VSPW and 58.2 STQ on VIPSeg, while preserving strong generalization to both category- and prompt-specified tasks. These findings demonstrate a practical, universal approach to video segmentation with potential for broader video-language integration and zero-copy prompt-based tracking.

Abstract

Despite the recent advances in unified image segmentation (IS), developing a unified video segmentation (VS) model remains a challenge. This is mainly because generic category-specified VS tasks need to detect all objects and track them across consecutive frames, while prompt-guided VS tasks require re-identifying the target with visual/text prompts throughout the entire video, making it hard to handle the different tasks with the same architecture. We make an attempt to address these issues and present a novel unified VS architecture, namely UniVS, by using prompts as queries. UniVS averages the prompt features of the target from previous frames as its initial query to explicitly decode masks, and introduces a target-wise prompt cross-attention layer in the mask decoder to integrate prompt features in the memory pool. By taking the predicted masks of entities from previous frames as their visual prompts, UniVS converts different VS tasks into prompt-guided target segmentation, eliminating the heuristic inter-frame matching process. Our framework not only unifies the different VS tasks but also naturally achieves universal training and testing, ensuring robust performance across different scenarios. UniVS shows a commendable balance between performance and universality on 10 challenging VS benchmarks, covering video instance, semantic, panoptic, object, and referring segmentation tasks. Code can be found at \url{https://github.com/MinghanLi/UniVS}.

UniVS: Unified and Universal Video Segmentation with Prompts as Queries

TL;DR

UniVS addresses the fragmented landscape of video segmentation by unifying category- and prompt-guided tasks under a single architecture. It treats prompts as queries and uses an averaged prompt-derived initial query plus a target-wise prompt cross-attention mechanism to decode masks, with a memory pool that updates prompts frame-by-frame, eliminating heuristic inter-frame matching. The model comprises an Image Encoder, a Prompt Encoder, and a Unified Video Mask Decoder that processes prompts and images through ProCA, image cross-attention, and separated self-attention, enabling prompt-guided target segmentation across frames. Empirically, UniVS achieves competitive or state-of-the-art results across 10 VS benchmarks (VIS, VSS, VPS, VOS, RefVOS, PVOS), including 92.3 mVC on VSPW and 58.2 STQ on VIPSeg, while preserving strong generalization to both category- and prompt-specified tasks. These findings demonstrate a practical, universal approach to video segmentation with potential for broader video-language integration and zero-copy prompt-based tracking.

Abstract

Despite the recent advances in unified image segmentation (IS), developing a unified video segmentation (VS) model remains a challenge. This is mainly because generic category-specified VS tasks need to detect all objects and track them across consecutive frames, while prompt-guided VS tasks require re-identifying the target with visual/text prompts throughout the entire video, making it hard to handle the different tasks with the same architecture. We make an attempt to address these issues and present a novel unified VS architecture, namely UniVS, by using prompts as queries. UniVS averages the prompt features of the target from previous frames as its initial query to explicitly decode masks, and introduces a target-wise prompt cross-attention layer in the mask decoder to integrate prompt features in the memory pool. By taking the predicted masks of entities from previous frames as their visual prompts, UniVS converts different VS tasks into prompt-guided target segmentation, eliminating the heuristic inter-frame matching process. Our framework not only unifies the different VS tasks but also naturally achieves universal training and testing, ensuring robust performance across different scenarios. UniVS shows a commendable balance between performance and universality on 10 challenging VS benchmarks, covering video instance, semantic, panoptic, object, and referring segmentation tasks. Code can be found at \url{https://github.com/MinghanLi/UniVS}.
Paper Structure (24 sections, 9 equations, 10 figures, 10 tables)

This paper contains 24 sections, 9 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Illustration of different video segmentation (VS) tasks. Category-specified VS includes VIS, VSS and VPS tasks, while prompt-specified VS consists of VOS, RefVOS and PVOS tasks. Please find more video demos on our project page https://sites.google.com/view/unified-video-seg-univs.
  • Figure 2: Comparison between the existing unified segmentation methods and ours. In existing methods for category-specified segmentation tasks (see (a1)), entities need to be first detected per frame and then matched across frames, while in methods for prompt-specified segmentation tasks (see (a2)), targets need to be identified from the predicted masks. In contrast, our proposed UniVS (see (b)) uses predicted masks as pseudo visual prompts and averages prompt features to decode masks across videos, avoiding the heuristic post-processing process.
  • Figure 3: Training process of our unified video segmentation (UniVS) framework. UniVS contains three main modules: the Image Encoder (grey rectangle), the Prompt Encoder (purple rectangle) and the Unified Video Mask Decoder (yellow rectangle). The Image Encoder transforms the input RGB images to the feature space and outputs image embeddings. Meanwhile, the Prompt Encoder translates the raw visual/text prompts into prompt embeddings. The Unified Video Mask Decoder explicitly decodes the masks for any entity or prompt-guided target in the input video by using prompts as queries (striped triangles, hexagons and circles).
  • Figure 4: Inference process of our UniVS on prompt-specified and category-specified video segmentation tasks, respectively.
  • Figure 5: Qualitative results of UniVS w/o and w/ ProCA on VOS task.
  • ...and 5 more figures