Table of Contents
Fetching ...

Instance Brownian Bridge as Texts for Open-vocabulary Video Instance Segmentation

Zesen Cheng, Kehan Li, Hao Li, Peng Jin, Chang Liu, Xiawu Zheng, Rongrong Ji, Jie Chen

TL;DR

The paper tackles open-vocabulary video instance segmentation by introducing BriVIS, which models frame-level instance dynamics as a Brownian bridge and aligns the bridge center to class texts in an image-text space. The approach freezes a pretrained video segmentor and adds a Temporal Instance Resampler to capture temporal context, complemented by Bridge-Text Alignment with multiple contrastive objectives to enforce bridge fidelity and text alignment. Empirical results on BURST and standard VIS benchmarks show BriVIS achieving state-of-the-art open-vocabulary performance and competitive close-vocabulary results, highlighting improved vocabulary generalization and temporal coherence. This dynamics-aware framework offers a practical path to robust open-vocabulary VIS in real-world video analysis tasks.

Abstract

Temporally locating objects with arbitrary class texts is the primary pursuit of open-vocabulary Video Instance Segmentation (VIS). Because of the insufficient vocabulary of video data, previous methods leverage image-text pretraining model for recognizing object instances by separately aligning each frame and class texts, ignoring the correlation between frames. As a result, the separation breaks the instance movement context of videos, causing inferior alignment between video and text. To tackle this issue, we propose to link frame-level instance representations as a Brownian Bridge to model instance dynamics and align bridge-level instance representation to class texts for more precisely open-vocabulary VIS (BriVIS). Specifically, we build our system upon a frozen video segmentor to generate frame-level instance queries, and design Temporal Instance Resampler (TIR) to generate queries with temporal context from frame queries. To mold instance queries to follow Brownian bridge and accomplish alignment with class texts, we design Bridge-Text Alignment (BTA) to learn discriminative bridge-level representations of instances via contrastive objectives. Setting MinVIS as the basic video segmentor, BriVIS surpasses the Open-vocabulary SOTA (OV2Seg) by a clear margin. For example, on the challenging large-vocabulary VIS dataset (BURST), BriVIS achieves 7.43 mAP and exhibits 49.49% improvement compared to OV2Seg (4.97 mAP).

Instance Brownian Bridge as Texts for Open-vocabulary Video Instance Segmentation

TL;DR

The paper tackles open-vocabulary video instance segmentation by introducing BriVIS, which models frame-level instance dynamics as a Brownian bridge and aligns the bridge center to class texts in an image-text space. The approach freezes a pretrained video segmentor and adds a Temporal Instance Resampler to capture temporal context, complemented by Bridge-Text Alignment with multiple contrastive objectives to enforce bridge fidelity and text alignment. Empirical results on BURST and standard VIS benchmarks show BriVIS achieving state-of-the-art open-vocabulary performance and competitive close-vocabulary results, highlighting improved vocabulary generalization and temporal coherence. This dynamics-aware framework offers a practical path to robust open-vocabulary VIS in real-world video analysis tasks.

Abstract

Temporally locating objects with arbitrary class texts is the primary pursuit of open-vocabulary Video Instance Segmentation (VIS). Because of the insufficient vocabulary of video data, previous methods leverage image-text pretraining model for recognizing object instances by separately aligning each frame and class texts, ignoring the correlation between frames. As a result, the separation breaks the instance movement context of videos, causing inferior alignment between video and text. To tackle this issue, we propose to link frame-level instance representations as a Brownian Bridge to model instance dynamics and align bridge-level instance representation to class texts for more precisely open-vocabulary VIS (BriVIS). Specifically, we build our system upon a frozen video segmentor to generate frame-level instance queries, and design Temporal Instance Resampler (TIR) to generate queries with temporal context from frame queries. To mold instance queries to follow Brownian bridge and accomplish alignment with class texts, we design Bridge-Text Alignment (BTA) to learn discriminative bridge-level representations of instances via contrastive objectives. Setting MinVIS as the basic video segmentor, BriVIS surpasses the Open-vocabulary SOTA (OV2Seg) by a clear margin. For example, on the challenging large-vocabulary VIS dataset (BURST), BriVIS achieves 7.43 mAP and exhibits 49.49% improvement compared to OV2Seg (4.97 mAP).
Paper Structure (23 sections, 14 equations, 8 figures, 6 tables)

This paper contains 23 sections, 14 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: The mechanical difference between (a) frame-text and (b) bridge-text (ours) alignment. Because of the deficient vocabulary of video data, OVVIS adopts image-text VLP model to provide semantic space. Previous methods recognize instances by integrating frame-text alignment results. Our method links frame-level instance features as a Brownian bridge and aligns the bridge center to class texts to consider instance movement information when recognizing video instances. Yellow circles, Green circles, and Diamond denote frame-level instance features, class text features, and bridge center. $\odot$ denotes calculating alignment score.
  • Figure 2: The overall pipeline of our BriVIS. Our main designs are TIR (Sec.\ref{['sec:tir']}) and BTA (Sec. \ref{['sec:bta']}). The former regenerates instance queries by building connections between independent instance queries. The latter is used to link instances spanning different frames as a Brownian bridge and align them and text at bridge granularity. The learning of BTA requires sufficient sampling frames, causing expensive computation costs. Therefore, we split training into two stages to pretrain the video segmentor and train TIR and BTA.
  • Figure 3: Temporal Instance Resampler. The resampler contains inter-frame and intra-frame modules. The former is used to capture long-range and short-range temporal context information. The latter is used to regenerate instance queries via temporal context. With regenerated instance queries $\mathbfcal{Q}$, we can get new segmentation masks $\mathbfcal{M}$ with better temporal consistency. TIR has $L$ times repetition during calculation. In $l$-th time, TIR adopts $\mathbfcal{F}^{o_{[l\%3]}}$ as memories to attend cross-attention of intra-frame module.
  • Figure 4: Bridge-Text Alignment. BTA serves as a training mechanism and can be divided into two steps: (1) Linking instances spanning multiple frames as a Brownian bridge; (2) Aligning instance Brownian bridge to class text. The first step is implemented via (a) Head-Tail Matching & Bridge Contrastive losses. The second step is achieved by (b) Bridge-Text Contrastive loss.
  • Figure 5: Correlation between IoU and Video Instance Length of (a) Baseline and (b) BriVIS (ours), which is based on Youtube-VIS 2019 train split. The darker area indicates more samples are of the corresponding IoU value and video instance length.
  • ...and 3 more figures