Table of Contents
Fetching ...

Deep Common Feature Mining for Efficient Video Semantic Segmentation

Yaoyan Zheng, Hongyu Yang, Di Huang

TL;DR

This work addresses the efficiency gap in video semantic segmentation by introducing Deep Common Feature Mining (DCFM), which decouples backbone features into a reusable deep common representation and a frame-specific independent component. By pairing a lightweight feature fusion module with a symmetric training strategy and a self-supervised consistency loss, DCFM enables direct reuse of high-level information across frames while preserving per-frame details, yielding fast non-keyframe inference without sacrificing accuracy. The approach demonstrates strong speed–accuracy trade-offs on VSPW, Cityscapes, and CamVid, including substantial non-keyframe speedups and improved temporal consistency, and is supported by ablations that underscore the importance of feature decomposition and the consistency loss. Together, these contributions offer a robust, scalable solution for practical VSS deployment in high-frame-rate or resource-constrained scenarios.

Abstract

Recent advancements in video semantic segmentation have made substantial progress by exploiting temporal correlations. Nevertheless, persistent challenges, including redundant computation and the reliability of the feature propagation process, underscore the need for further innovation. In response, we present Deep Common Feature Mining (DCFM), a novel approach strategically designed to address these challenges by leveraging the concept of feature sharing. DCFM explicitly decomposes features into two complementary components. The common representation extracted from a key-frame furnishes essential high-level information to neighboring non-key frames, allowing for direct re-utilization without feature propagation. Simultaneously, the independent feature, derived from each video frame, captures rapidly changing information, providing frame-specific clues crucial for segmentation. To achieve such decomposition, we employ a symmetric training strategy tailored for sparsely annotated data, empowering the backbone to learn a robust high-level representation enriched with common information. Additionally, we incorporate a self-supervised loss function to reinforce intra-class feature similarity and enhance temporal consistency. Experimental evaluations on the VSPW and Cityscapes datasets demonstrate the effectiveness of our method, showing a superior balance between accuracy and efficiency. The implementation is available at https://github.com/BUAAHugeGun/DCFM.

Deep Common Feature Mining for Efficient Video Semantic Segmentation

TL;DR

This work addresses the efficiency gap in video semantic segmentation by introducing Deep Common Feature Mining (DCFM), which decouples backbone features into a reusable deep common representation and a frame-specific independent component. By pairing a lightweight feature fusion module with a symmetric training strategy and a self-supervised consistency loss, DCFM enables direct reuse of high-level information across frames while preserving per-frame details, yielding fast non-keyframe inference without sacrificing accuracy. The approach demonstrates strong speed–accuracy trade-offs on VSPW, Cityscapes, and CamVid, including substantial non-keyframe speedups and improved temporal consistency, and is supported by ablations that underscore the importance of feature decomposition and the consistency loss. Together, these contributions offer a robust, scalable solution for practical VSS deployment in high-frame-rate or resource-constrained scenarios.

Abstract

Recent advancements in video semantic segmentation have made substantial progress by exploiting temporal correlations. Nevertheless, persistent challenges, including redundant computation and the reliability of the feature propagation process, underscore the need for further innovation. In response, we present Deep Common Feature Mining (DCFM), a novel approach strategically designed to address these challenges by leveraging the concept of feature sharing. DCFM explicitly decomposes features into two complementary components. The common representation extracted from a key-frame furnishes essential high-level information to neighboring non-key frames, allowing for direct re-utilization without feature propagation. Simultaneously, the independent feature, derived from each video frame, captures rapidly changing information, providing frame-specific clues crucial for segmentation. To achieve such decomposition, we employ a symmetric training strategy tailored for sparsely annotated data, empowering the backbone to learn a robust high-level representation enriched with common information. Additionally, we incorporate a self-supervised loss function to reinforce intra-class feature similarity and enhance temporal consistency. Experimental evaluations on the VSPW and Cityscapes datasets demonstrate the effectiveness of our method, showing a superior balance between accuracy and efficiency. The implementation is available at https://github.com/BUAAHugeGun/DCFM.
Paper Structure (16 sections, 8 equations, 10 figures, 9 tables, 1 algorithm)

This paper contains 16 sections, 8 equations, 10 figures, 9 tables, 1 algorithm.

Figures (10)

  • Figure 1: The trade-off between speed and accuracy is illustrated through various semantic segmentation methods using MiT backbones on the VSPW validation set vspw. These methods include SegFormer (image baseline) segformer, CFFM cffm, MRCFA mrcfa, and our proposed method, DCFM. By adjusting the keyframe interval $K$ during inference, DCFM achieves impressive speed (80 FPS @ $K$=10) while maintaining a high level of accuracy (46% mIoU).
  • Figure 2: Video semantic segmentation pipelines. (a) Applying an image segmentation model to each frame independently; (b) Aggregating features from multiple frames to predict the segmentation map for a target frame, enhancing accuracy; (c) Holistically propagating the feature of a keyframe to subsequent non-key frames for reuse, improving efficiency; and (d) The proposed method decomposes the feature into two complementary parts, enabling direct reuse of the common part by non-key frames without re-calibration, improving efficiency.
  • Figure 3: Overview of Deep Common Feature Mining (DCFM) approach. (a) The complete network architecture for video segmentation during inference. The backbone, based on an image semantic segmentation network, groups layers into two stages to output common and independent features, respectively. The extracted common representation $F_{co}$ can be explicitly reused across frames for high efficiency. (b) Structure of the Feature Fusion Module (FFM), where the exemplar target frame is a non-key frame.
  • Figure 4: Illustration of the common feature mining process. A labeled frame $x_l$ serves alternately as both the keyframe and non-key frame. This cyclic treatment facilitates implicit supervision of the deep features in an unlabeled neighboring frame $x_u$.
  • Figure 5: Illustration of the proposed consistency loss. We minimize the distance between $F^u_p$ and $F^l_p$, where $p\in M_{inter}$. On a moving object, this operation is akin to minimizing the distance between $F^{u}_{p1}$ and $F^{u}_{p3}$.
  • ...and 5 more figures