Towards Open-Vocabulary Video Semantic Segmentation
Xinhao Li, Yun Liu, Guolei Sun, Min Wu, Le Zhang, Ce Zhu
TL;DR
This work defines Open Vocabulary Video Semantic Segmentation (OV-VSS) and presents OV2VSS, a baseline that jointly leverages spatial-temporal fusion, random long-range frame enhancement, and video-aware text encoding to achieve pixel-level labeling across open-category vocabularies. The Spatial-Temporal Context Fusion module, Random Frame Enhancement, and a Video Text Encoding module enable robust cross-frame reasoning and textual grounding, while complexity analysis ensures scalable deployment. Evaluations on the VSPW and Cityscapes datasets demonstrate strong zero-shot generalization to unseen categories and clear improvements over image-based open-vocabulary methods, with both quantitative gains and qualitative improvements in mask quality. The approach aims to advance open-vocabulary video understanding with a practical, end-to-end framework suitable for real-world scenarios like autonomous driving and surveillance, where novel categories continuously emerge.
Abstract
Semantic segmentation in videos has been a focal point of recent research. However, existing models encounter challenges when faced with unfamiliar categories. To address this, we introduce the Open Vocabulary Video Semantic Segmentation (OV-VSS) task, designed to accurately segment every pixel across a wide range of open-vocabulary categories, including those that are novel or previously unexplored. To enhance OV-VSS performance, we propose a robust baseline, OV2VSS, which integrates a spatial-temporal fusion module, allowing the model to utilize temporal relationships across consecutive frames. Additionally, we incorporate a random frame enhancement module, broadening the model's understanding of semantic context throughout the entire video sequence. Our approach also includes video text encoding, which strengthens the model's capability to interpret textual information within the video context. Comprehensive evaluations on benchmark datasets such as VSPW and Cityscapes highlight OV-VSS's zero-shot generalization capabilities, especially in handling novel categories. The results validate OV2VSS's effectiveness, demonstrating improved performance in semantic segmentation tasks across diverse video datasets.
