Table of Contents
Fetching ...

Static-Dynamic Class-level Perception Consistency in Video Semantic Segmentation

Zhigang Cen, Ningyan Guo, Wenjing Xu, Zhiyong Feng, Danlan Huang

TL;DR

The paper tackles video semantic segmentation (VSS) by shifting focus from pixel-level temporal alignment to static-dynamic class-level perception. It introduces the SD-CPC framework, combining multivariate class prototypes with contrastive learning (MCP-CL) and a static-dynamic semantic alignment module (SSEA and DSSA) that uses window-based attention to reduce computation. The approach constrains inter- and intra-class feature relationships while progressively aggregating cross-frame information from coarse to fine scales, resulting in improved segmentation accuracy and temporal consistency with lower computational cost. Empirical results on VSPW and Cityscapes demonstrate state-of-the-art performance, and the authors provide open-source code to facilitate adoption and further research. The method offers a practical pathway toward robust, efficient VSS in real-world settings by leveraging class-level cues and selective cross-frame aggregation.

Abstract

Video semantic segmentation(VSS) has been widely employed in lots of fields, such as simultaneous localization and mapping, autonomous driving and surveillance. Its core challenge is how to leverage temporal information to achieve better segmentation. Previous efforts have primarily focused on pixel-level static-dynamic contexts matching, utilizing techniques such as optical flow and attention mechanisms. Instead, this paper rethinks static-dynamic contexts at the class level and proposes a novel static-dynamic class-level perceptual consistency (SD-CPC) framework. In this framework, we propose multivariate class prototype with contrastive learning and a static-dynamic semantic alignment module. The former provides class-level constraints for the model, obtaining personalized inter-class features and diversified intra-class features. The latter first establishes intra-frame spatial multi-scale and multi-level correlations to achieve static semantic alignment. Then, based on cross-frame static perceptual differences, it performs two-stage cross-frame selective aggregation to achieve dynamic semantic alignment. Meanwhile, we propose a window-based attention map calculation method that leverages the sparsity of attention points during cross-frame aggregation to reduce computation cost. Extensive experiments on VSPW and Cityscapes datasets show that the proposed approach outperforms state-of-the-art methods. Our implementation will be open-sourced on GitHub.

Static-Dynamic Class-level Perception Consistency in Video Semantic Segmentation

TL;DR

The paper tackles video semantic segmentation (VSS) by shifting focus from pixel-level temporal alignment to static-dynamic class-level perception. It introduces the SD-CPC framework, combining multivariate class prototypes with contrastive learning (MCP-CL) and a static-dynamic semantic alignment module (SSEA and DSSA) that uses window-based attention to reduce computation. The approach constrains inter- and intra-class feature relationships while progressively aggregating cross-frame information from coarse to fine scales, resulting in improved segmentation accuracy and temporal consistency with lower computational cost. Empirical results on VSPW and Cityscapes demonstrate state-of-the-art performance, and the authors provide open-source code to facilitate adoption and further research. The method offers a practical pathway toward robust, efficient VSS in real-world settings by leveraging class-level cues and selective cross-frame aggregation.

Abstract

Video semantic segmentation(VSS) has been widely employed in lots of fields, such as simultaneous localization and mapping, autonomous driving and surveillance. Its core challenge is how to leverage temporal information to achieve better segmentation. Previous efforts have primarily focused on pixel-level static-dynamic contexts matching, utilizing techniques such as optical flow and attention mechanisms. Instead, this paper rethinks static-dynamic contexts at the class level and proposes a novel static-dynamic class-level perceptual consistency (SD-CPC) framework. In this framework, we propose multivariate class prototype with contrastive learning and a static-dynamic semantic alignment module. The former provides class-level constraints for the model, obtaining personalized inter-class features and diversified intra-class features. The latter first establishes intra-frame spatial multi-scale and multi-level correlations to achieve static semantic alignment. Then, based on cross-frame static perceptual differences, it performs two-stage cross-frame selective aggregation to achieve dynamic semantic alignment. Meanwhile, we propose a window-based attention map calculation method that leverages the sparsity of attention points during cross-frame aggregation to reduce computation cost. Extensive experiments on VSPW and Cityscapes datasets show that the proposed approach outperforms state-of-the-art methods. Our implementation will be open-sourced on GitHub.

Paper Structure

This paper contains 11 sections, 7 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Comparison of the different methods. (a) The direct methods explicitly distort features based on pre-trained optical flow networks, resulting in inconsistent information. (b) The indirect methods model the relationship between all pixels with the attention mechanism, leading to extremely high computation cost. (c) The proposed method models the static-dynamic spatio-temporal associations at the category level, achieving more efficient and accurate results.
  • Figure 2: Framework of the proposed SD-CPC framework. First, we model the spatial relationship of pixel features extracted by the backbone at multi-scale and multi-level, achieving static semantic alignment. Then, based on the cross-frame static semantic differences , we conduct the two-stage dynamic semantic selective aggregation to achieve dynamic semantic alignment. During training, we obtain multivariate class prototypes based on the prediction results and output features, and then combine them with contrastive learning to realize class-level constraints and improve the model's representation capability.
  • Figure 3: Qualitative results. We compare the proposed method with the baseline (SegFormer with backbone MiT-B1) visually. From top to down: the input video frames, the predictions of SegFormer, our predictions, and the ground truth (GT). The proposed method generates better results than the baseline in terms of accuracy and VC.