Table of Contents
Fetching ...

Few-Shot Referring Video Single- and Multi-Object Segmentation via Cross-Modal Affinity with Instance Sequence Matching

Heng Liu, Guanghui Li, Mingqi Gao, Xiantong Zhen, Feng Zheng, Yang Wang

TL;DR

This work targets the problem of few-shot referring video segmentation, introducing FS-RVOS and its multi-object extension FS-RVMOS. It proposes a Cross-Modal Affinity (CMA) module to fuse visual and linguistic cues and an Instance Sequence Matching (ISM) mechanism to select among multiple object trajectories across frames, all within a Transformer-based framework. The authors contribute three new benchmarks—Mini-Ref-YouTube-VOS, Mini-Ref-SAIL-VOS, and Mini-MeViS—to evaluate single- and multi-object scenarios under few-shot conditions, and demonstrate state-of-the-art performance with robust cross-domain generalization. Ablation studies validate the effectiveness of CMA, textual expressions, and ISM, highlighting improvements in both accuracy and flexibility for real-world, multi-object video segmentation.

Abstract

Referring video object segmentation (RVOS) aims to segment objects in videos guided by natural language descriptions. We propose FS-RVOS, a Transformer-based model with two key components: a cross-modal affinity module and an instance sequence matching strategy, which extends FS-RVOS to multi-object segmentation (FS-RVMOS). Experiments show FS-RVOS and FS-RVMOS outperform state-of-the-art methods across diverse benchmarks, demonstrating superior robustness and accuracy.

Few-Shot Referring Video Single- and Multi-Object Segmentation via Cross-Modal Affinity with Instance Sequence Matching

TL;DR

This work targets the problem of few-shot referring video segmentation, introducing FS-RVOS and its multi-object extension FS-RVMOS. It proposes a Cross-Modal Affinity (CMA) module to fuse visual and linguistic cues and an Instance Sequence Matching (ISM) mechanism to select among multiple object trajectories across frames, all within a Transformer-based framework. The authors contribute three new benchmarks—Mini-Ref-YouTube-VOS, Mini-Ref-SAIL-VOS, and Mini-MeViS—to evaluate single- and multi-object scenarios under few-shot conditions, and demonstrate state-of-the-art performance with robust cross-domain generalization. Ablation studies validate the effectiveness of CMA, textual expressions, and ISM, highlighting improvements in both accuracy and flexibility for real-world, multi-object video segmentation.

Abstract

Referring video object segmentation (RVOS) aims to segment objects in videos guided by natural language descriptions. We propose FS-RVOS, a Transformer-based model with two key components: a cross-modal affinity module and an instance sequence matching strategy, which extends FS-RVOS to multi-object segmentation (FS-RVMOS). Experiments show FS-RVOS and FS-RVMOS outperform state-of-the-art methods across diverse benchmarks, demonstrating superior robustness and accuracy.

Paper Structure

This paper contains 30 sections, 8 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Comparison of Few-shot RVOS and RVOS. (a) The training and testing sets overlap in the RVOS. (b) Disjoint training and testing sets in the Few-shot RVOS. Different shapes represent different classes. (c) Few-shot RVOS segments the referred object of the same class as the support set in the video.
  • Figure 2: The overall pipeline of our framework. The feature encoder extracts visual and textual information from the support and query sets. The cross-modal affinity module calculates the multi-modal information affinity between the support set and the query set. Then, the fused features are enhanced by the transformer and are used to obtain a serial of segmentation masks through the kernel head. At the same time, the fused features are further refined across different scales by the feature pyramid network and are utilized to obtain the referring matching scores through the referring head. Finally, the multi-object segmentation result is obtained through the instance sequence matching process.
  • Figure 3: The architecture of the Cross-modal Affinity (CMA) module. We use multi-head cross-attention to fuse visual and text features to obtain more robust features. Self-affinity for modeling contextual information on query features and cross-affinity for aggregating beneficial information from support features.
  • Figure 4: The schematic diagram of ISM for referring video multi-object segmentation.
  • Figure 5: Annotation examples of the Mini-Ref-SAIL-VOS dataset.
  • ...and 5 more figures