Few-Shot Referring Video Single- and Multi-Object Segmentation via Cross-Modal Affinity with Instance Sequence Matching
Heng Liu, Guanghui Li, Mingqi Gao, Xiantong Zhen, Feng Zheng, Yang Wang
TL;DR
This work targets the problem of few-shot referring video segmentation, introducing FS-RVOS and its multi-object extension FS-RVMOS. It proposes a Cross-Modal Affinity (CMA) module to fuse visual and linguistic cues and an Instance Sequence Matching (ISM) mechanism to select among multiple object trajectories across frames, all within a Transformer-based framework. The authors contribute three new benchmarks—Mini-Ref-YouTube-VOS, Mini-Ref-SAIL-VOS, and Mini-MeViS—to evaluate single- and multi-object scenarios under few-shot conditions, and demonstrate state-of-the-art performance with robust cross-domain generalization. Ablation studies validate the effectiveness of CMA, textual expressions, and ISM, highlighting improvements in both accuracy and flexibility for real-world, multi-object video segmentation.
Abstract
Referring video object segmentation (RVOS) aims to segment objects in videos guided by natural language descriptions. We propose FS-RVOS, a Transformer-based model with two key components: a cross-modal affinity module and an instance sequence matching strategy, which extends FS-RVOS to multi-object segmentation (FS-RVMOS). Experiments show FS-RVOS and FS-RVMOS outperform state-of-the-art methods across diverse benchmarks, demonstrating superior robustness and accuracy.
