Instance Brownian Bridge as Texts for Open-vocabulary Video Instance Segmentation
Zesen Cheng, Kehan Li, Hao Li, Peng Jin, Chang Liu, Xiawu Zheng, Rongrong Ji, Jie Chen
TL;DR
The paper tackles open-vocabulary video instance segmentation by introducing BriVIS, which models frame-level instance dynamics as a Brownian bridge and aligns the bridge center to class texts in an image-text space. The approach freezes a pretrained video segmentor and adds a Temporal Instance Resampler to capture temporal context, complemented by Bridge-Text Alignment with multiple contrastive objectives to enforce bridge fidelity and text alignment. Empirical results on BURST and standard VIS benchmarks show BriVIS achieving state-of-the-art open-vocabulary performance and competitive close-vocabulary results, highlighting improved vocabulary generalization and temporal coherence. This dynamics-aware framework offers a practical path to robust open-vocabulary VIS in real-world video analysis tasks.
Abstract
Temporally locating objects with arbitrary class texts is the primary pursuit of open-vocabulary Video Instance Segmentation (VIS). Because of the insufficient vocabulary of video data, previous methods leverage image-text pretraining model for recognizing object instances by separately aligning each frame and class texts, ignoring the correlation between frames. As a result, the separation breaks the instance movement context of videos, causing inferior alignment between video and text. To tackle this issue, we propose to link frame-level instance representations as a Brownian Bridge to model instance dynamics and align bridge-level instance representation to class texts for more precisely open-vocabulary VIS (BriVIS). Specifically, we build our system upon a frozen video segmentor to generate frame-level instance queries, and design Temporal Instance Resampler (TIR) to generate queries with temporal context from frame queries. To mold instance queries to follow Brownian bridge and accomplish alignment with class texts, we design Bridge-Text Alignment (BTA) to learn discriminative bridge-level representations of instances via contrastive objectives. Setting MinVIS as the basic video segmentor, BriVIS surpasses the Open-vocabulary SOTA (OV2Seg) by a clear margin. For example, on the challenging large-vocabulary VIS dataset (BURST), BriVIS achieves 7.43 mAP and exhibits 49.49% improvement compared to OV2Seg (4.97 mAP).
