Table of Contents
Fetching ...

MCDS-VSS: Moving Camera Dynamic Scene Video Semantic Segmentation by Filtering with Self-Supervised Geometry and Motion

Angel Villar-Corrales, Moritz Austermann, Sven Behnke

TL;DR

This work tackles the challenge of temporally coherent video semantic segmentation for moving cameras. It introduces MCDS-VSS, a structured recurrent filter that explicitly models scene geometry (depth), ego-motion, and dynamic object motion (via a residual flow) and fuses projected previous features with current frame information in a predicted-todelayed manner. Through a four-stage self-supervised training regime, the model learns interpretable representations and achieves superior temporal consistency while maintaining competitive per-frame segmentation accuracy on Cityscapes, outperforming several baselines and closely matching state-of-the-art VSS methods. The approach demonstrates the value of incorporating domain-specific inductive biases for dynamic scenes and provides depth and motion representations that enhance robustness and interpretability in automotive vision tasks.

Abstract

Autonomous systems, such as self-driving cars, rely on reliable semantic environment perception for decision making. Despite great advances in video semantic segmentation, existing approaches ignore important inductive biases and lack structured and interpretable internal representations. In this work, we propose MCDS-VSS, a structured filter model that learns in a self-supervised manner to estimate scene geometry and ego-motion of the camera, while also estimating the motion of external objects. Our model leverages these representations to improve the temporal consistency of semantic segmentation without sacrificing segmentation accuracy. MCDS-VSS follows a prediction-fusion approach in which scene geometry and camera motion are first used to compensate for ego-motion, then residual flow is used to compensate motion of dynamic objects, and finally the predicted scene features are fused with the current features to obtain a temporally consistent scene segmentation. Our model parses automotive scenes into multiple decoupled interpretable representations such as scene geometry, ego-motion, and object motion. Quantitative evaluation shows that MCDS-VSS achieves superior temporal consistency on video sequences while retaining competitive segmentation performance.

MCDS-VSS: Moving Camera Dynamic Scene Video Semantic Segmentation by Filtering with Self-Supervised Geometry and Motion

TL;DR

This work tackles the challenge of temporally coherent video semantic segmentation for moving cameras. It introduces MCDS-VSS, a structured recurrent filter that explicitly models scene geometry (depth), ego-motion, and dynamic object motion (via a residual flow) and fuses projected previous features with current frame information in a predicted-todelayed manner. Through a four-stage self-supervised training regime, the model learns interpretable representations and achieves superior temporal consistency while maintaining competitive per-frame segmentation accuracy on Cityscapes, outperforming several baselines and closely matching state-of-the-art VSS methods. The approach demonstrates the value of incorporating domain-specific inductive biases for dynamic scenes and provides depth and motion representations that enhance robustness and interpretability in automotive vision tasks.

Abstract

Autonomous systems, such as self-driving cars, rely on reliable semantic environment perception for decision making. Despite great advances in video semantic segmentation, existing approaches ignore important inductive biases and lack structured and interpretable internal representations. In this work, we propose MCDS-VSS, a structured filter model that learns in a self-supervised manner to estimate scene geometry and ego-motion of the camera, while also estimating the motion of external objects. Our model leverages these representations to improve the temporal consistency of semantic segmentation without sacrificing segmentation accuracy. MCDS-VSS follows a prediction-fusion approach in which scene geometry and camera motion are first used to compensate for ego-motion, then residual flow is used to compensate motion of dynamic objects, and finally the predicted scene features are fused with the current features to obtain a temporally consistent scene segmentation. Our model parses automotive scenes into multiple decoupled interpretable representations such as scene geometry, ego-motion, and object motion. Quantitative evaluation shows that MCDS-VSS achieves superior temporal consistency on video sequences while retaining competitive segmentation performance.
Paper Structure (23 sections, 8 equations, 18 figures, 10 tables)

This paper contains 23 sections, 8 equations, 18 figures, 10 tables.

Figures (18)

  • Figure 1: MCDS-VSS structured filter. Scene depth $\mathbf{d}_{t-1}$, ego-motion $\textrm{C}_{{t-1}}^{\,{t}}$, and object-motion $\textrm{F}_{t-1}^{\,t}$ are used to project scene features $\mathbf{s}_{t-1}$ to the current time $t$, where they are fused with current image features $\mathbf{h}_{t}$ to obtain a temporally consistent semantic segmentation $\mathbf{\hat{y}}_{t}$.
  • Figure 2: Learning geometry and motion. a) We learn the scene depth $\mathbf{d}_{t-1}$ and ego-motion $\textrm{C}_{{t-1}}^{\,{t}}$ in a self-supervised manner given two video frames by enforcing a photometric loss $\mathcal{L}_\textrm{Photo}$ between the ego-warped $\mathbf{\hat{x}}^\textrm{ego}_{t}$ and target frames $\mathbf{x}_{t}$, as well as a depth regularization $\mathcal{L}_\textrm{Reg}$. b) Given an ego-warped image, we train a residual flow decoder to predict the residual optical flow $\hat{\textrm{F}}_{{t-1}}^{\,{t}}$ that parameterizes the dynamics of moving objects in the scene by distilling a pretrained RAFT model.
  • Figure 3: Qualitative evaluation on a validation sequence of five frames. a) Input frames, b) HRNetV2, c) MCDS-VSS (ours), d) Estimated scene depth, e) Estimated residual flow. We highlight areas of the segmentation masks where MCDS-VSS obtains visibly more accurate and temporally consistent segmentations, such as the traffic signs or the bus, which HRNetV2 mislabels as truck.
  • Figure 3: Comparison of various filter designs. We highlight the diff. to baseline.
  • Figure 4: Video segmentation for each stage in MCDS-VSS. a) Input images, b) segmentation after ego-motion compensation, c) segmentation after object motion compensation, d) segmentation after feature fusion, e) feature fusion update mask, lighter colors mean that filter information is used, whereas darker ones correspond to observations.
  • ...and 13 more figures