Table of Contents
Fetching ...

Towards Open-Vocabulary Video Semantic Segmentation

Xinhao Li, Yun Liu, Guolei Sun, Min Wu, Le Zhang, Ce Zhu

TL;DR

This work defines Open Vocabulary Video Semantic Segmentation (OV-VSS) and presents OV2VSS, a baseline that jointly leverages spatial-temporal fusion, random long-range frame enhancement, and video-aware text encoding to achieve pixel-level labeling across open-category vocabularies. The Spatial-Temporal Context Fusion module, Random Frame Enhancement, and a Video Text Encoding module enable robust cross-frame reasoning and textual grounding, while complexity analysis ensures scalable deployment. Evaluations on the VSPW and Cityscapes datasets demonstrate strong zero-shot generalization to unseen categories and clear improvements over image-based open-vocabulary methods, with both quantitative gains and qualitative improvements in mask quality. The approach aims to advance open-vocabulary video understanding with a practical, end-to-end framework suitable for real-world scenarios like autonomous driving and surveillance, where novel categories continuously emerge.

Abstract

Semantic segmentation in videos has been a focal point of recent research. However, existing models encounter challenges when faced with unfamiliar categories. To address this, we introduce the Open Vocabulary Video Semantic Segmentation (OV-VSS) task, designed to accurately segment every pixel across a wide range of open-vocabulary categories, including those that are novel or previously unexplored. To enhance OV-VSS performance, we propose a robust baseline, OV2VSS, which integrates a spatial-temporal fusion module, allowing the model to utilize temporal relationships across consecutive frames. Additionally, we incorporate a random frame enhancement module, broadening the model's understanding of semantic context throughout the entire video sequence. Our approach also includes video text encoding, which strengthens the model's capability to interpret textual information within the video context. Comprehensive evaluations on benchmark datasets such as VSPW and Cityscapes highlight OV-VSS's zero-shot generalization capabilities, especially in handling novel categories. The results validate OV2VSS's effectiveness, demonstrating improved performance in semantic segmentation tasks across diverse video datasets.

Towards Open-Vocabulary Video Semantic Segmentation

TL;DR

This work defines Open Vocabulary Video Semantic Segmentation (OV-VSS) and presents OV2VSS, a baseline that jointly leverages spatial-temporal fusion, random long-range frame enhancement, and video-aware text encoding to achieve pixel-level labeling across open-category vocabularies. The Spatial-Temporal Context Fusion module, Random Frame Enhancement, and a Video Text Encoding module enable robust cross-frame reasoning and textual grounding, while complexity analysis ensures scalable deployment. Evaluations on the VSPW and Cityscapes datasets demonstrate strong zero-shot generalization to unseen categories and clear improvements over image-based open-vocabulary methods, with both quantitative gains and qualitative improvements in mask quality. The approach aims to advance open-vocabulary video understanding with a practical, end-to-end framework suitable for real-world scenarios like autonomous driving and surveillance, where novel categories continuously emerge.

Abstract

Semantic segmentation in videos has been a focal point of recent research. However, existing models encounter challenges when faced with unfamiliar categories. To address this, we introduce the Open Vocabulary Video Semantic Segmentation (OV-VSS) task, designed to accurately segment every pixel across a wide range of open-vocabulary categories, including those that are novel or previously unexplored. To enhance OV-VSS performance, we propose a robust baseline, OV2VSS, which integrates a spatial-temporal fusion module, allowing the model to utilize temporal relationships across consecutive frames. Additionally, we incorporate a random frame enhancement module, broadening the model's understanding of semantic context throughout the entire video sequence. Our approach also includes video text encoding, which strengthens the model's capability to interpret textual information within the video context. Comprehensive evaluations on benchmark datasets such as VSPW and Cityscapes highlight OV-VSS's zero-shot generalization capabilities, especially in handling novel categories. The results validate OV2VSS's effectiveness, demonstrating improved performance in semantic segmentation tasks across diverse video datasets.

Paper Structure

This paper contains 18 sections, 15 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Comparison of VSS and OV-VSS: In traditional VSS, the model is trained on a closed set of classes (e.g., "person" and "door") and fails to segment novel classes (e.g., "horse," as shown in the figure)(upper). In contrast, our Open-Vocabulary Segmentation model is trained on base classes but can simultaneously segment both base and novel categories(lower).
  • Figure 2: The comparison between our method and image-based methods on the VSPW dataset, from top to bottom, includes the original frame, image method, our method, and ground truth. From the figure, it can be seen that image-based methods are prone to producing discontinuities in the image.
  • Figure 3: Overall structure of OV2VSS. Our approach primarily leverages three modules: the spatial-temporal information fusion module, which integrates spatial-temporal details from the video; the random frame enhancement module, which acquires contextual information from a randomly selected frame; the video text encoding, which utilizes text supervision in the training process. A denotes attention, + denotes element-wise addition, S denotes establish a cost-volume, C denotes Concatenate. The text encoder and image encoder are obtained from CLIP. Detailed explanations are included in \ref{['sec:methods']}.
  • Figure 4: The architecture of the Spatio-Temporal Context Fusion module, which includes a P2T pooling, cross-attention and a cross-scale Aggregation.
  • Figure 5: The architecture of the Random Frame Enhancement module utilizes cross-attention to aggregate contextual information from the randomly selected frame.
  • ...and 2 more figures