Table of Contents
Fetching ...

We're Not Using Videos Effectively: An Updated Domain Adaptive Video Segmentation Baseline

Simar Kareer, Vivek Vijaykumar, Harsh Maheshwari, Prithvijit Chattopadhyay, Judy Hoffman, Viraj Prabhu

TL;DR

This work benchmarks state-of-the-art Image-DAS methods against Video-DAS baselines on Video-DAS tasks, revealing that Image-DAS methods like HRDA and HRDA+MIC outperform specialized Video-DAS approaches across standard shifts. A central finding is that multi-resolution fusion drives the bulk of Image-DAS gains on video data, casting doubt on the added value of many Video-DAS techniques in current benchmarks. The authors also introduce UnifiedVideoDA, an open-source framework to enable unified benchmarking and cross-pollination between Image-DAS and Video-DAS research, and they provide extensive analyses of combining techniques and pseudo-label refinement. Overall, the results suggest that Image-DAS advances currently offer stronger, more consistent improvements for sim-to-real semantic segmentation than contemporary Video-DAS methods, while highlighting directions for future work in cross-benchmark methodologies and refinement strategies.

Abstract

There has been abundant work in unsupervised domain adaptation for semantic segmentation (DAS) seeking to adapt a model trained on images from a labeled source domain to an unlabeled target domain. While the vast majority of prior work has studied this as a frame-level Image-DAS problem, a few Video-DAS works have sought to additionally leverage the temporal signal present in adjacent frames. However, Video-DAS works have historically studied a distinct set of benchmarks from Image-DAS, with minimal cross-benchmarking. In this work, we address this gap. Surprisingly, we find that (1) even after carefully controlling for data and model architecture, state-of-the-art Image-DAS methods (HRDA and HRDA+MIC) outperform Video-DAS methods on established Video-DAS benchmarks (+14.5 mIoU on Viper$\rightarrow$CityscapesSeq, +19.0 mIoU on Synthia$\rightarrow$CityscapesSeq), and (2) naive combinations of Image-DAS and Video-DAS techniques only lead to marginal improvements across datasets. To avoid siloed progress between Image-DAS and Video-DAS, we open-source our codebase with support for a comprehensive set of Video-DAS and Image-DAS methods on a common benchmark. Code available at https://github.com/SimarKareer/UnifiedVideoDA

We're Not Using Videos Effectively: An Updated Domain Adaptive Video Segmentation Baseline

TL;DR

This work benchmarks state-of-the-art Image-DAS methods against Video-DAS baselines on Video-DAS tasks, revealing that Image-DAS methods like HRDA and HRDA+MIC outperform specialized Video-DAS approaches across standard shifts. A central finding is that multi-resolution fusion drives the bulk of Image-DAS gains on video data, casting doubt on the added value of many Video-DAS techniques in current benchmarks. The authors also introduce UnifiedVideoDA, an open-source framework to enable unified benchmarking and cross-pollination between Image-DAS and Video-DAS research, and they provide extensive analyses of combining techniques and pseudo-label refinement. Overall, the results suggest that Image-DAS advances currently offer stronger, more consistent improvements for sim-to-real semantic segmentation than contemporary Video-DAS methods, while highlighting directions for future work in cross-benchmark methodologies and refinement strategies.

Abstract

There has been abundant work in unsupervised domain adaptation for semantic segmentation (DAS) seeking to adapt a model trained on images from a labeled source domain to an unlabeled target domain. While the vast majority of prior work has studied this as a frame-level Image-DAS problem, a few Video-DAS works have sought to additionally leverage the temporal signal present in adjacent frames. However, Video-DAS works have historically studied a distinct set of benchmarks from Image-DAS, with minimal cross-benchmarking. In this work, we address this gap. Surprisingly, we find that (1) even after carefully controlling for data and model architecture, state-of-the-art Image-DAS methods (HRDA and HRDA+MIC) outperform Video-DAS methods on established Video-DAS benchmarks (+14.5 mIoU on ViperCityscapesSeq, +19.0 mIoU on SynthiaCityscapesSeq), and (2) naive combinations of Image-DAS and Video-DAS techniques only lead to marginal improvements across datasets. To avoid siloed progress between Image-DAS and Video-DAS, we open-source our codebase with support for a comprehensive set of Video-DAS and Image-DAS methods on a common benchmark. Code available at https://github.com/SimarKareer/UnifiedVideoDA
Paper Structure (23 sections, 8 equations, 6 figures, 12 tables)

This paper contains 23 sections, 8 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Overview. Recent domain adaptive video segmentation methods do not compare against state-of-the-art baselines for Image-DAS. We perform the first such cross-benchmarking, and find that even after controlling for data and model architecture, Image-DAS methods strongly outperform Video-DAS methods on the two key Video-DAS benchmarks.
  • Figure 2: Simplified Training Pipeline. Temporally separated frames are first augmented and then passed through a model $h_{\theta}\xspace$ to produce source and target predictions, which produce supervised and adaptation losses. In this standard self-training pipeline, Video-DAS methods typically add one or more of the following techniques: consistent mixup, ACCEL, pseudo-label refinement, and video discriminators. We further elaborate on each of these techniques in Figure \ref{['fig:vidTechs']}.
  • Figure 3: Key Video Techniques.a) Consistent mix-up ensures that paired frames receive the same class mix-up, in this case "car" and "truck". The ACCEL architecture makes predictions $\hat{y}_t$ and $\hat{y}_{t+k}$, aligns them via $\texttt{prop}(y_{t+k}, o_{t\rightarrow t+k})$, then fuses them with a $1\times1$ convolution. b) Video discriminators learn a classifier to distinguish whether temporally stacked features belong to the source or target domain, in conjunction with a feature encoder that is updated adversarially to make the two indistinguishable. c) Pseudo-label refinement improves predictions by fusing $\hat{y}_t$ and $\hat{y}_{t+k}$ based on one of several criteria.
  • Figure 4: Each pseudo-label refinement strategy takes as input the current prediction $\hat{y}_t$, as well as a prediction for a future frame $\hat{y}_{t+k}$ warped via optical flow to the current frame, and merges them together to make a refined prediction $\hat{y'}_t$. a) Consistency discards predictions that are inconsistent across frames. b) Max confidence selects the more confident prediction between frames. c) Warped frame uses the warped prediction instead of the current prediction. d) Oracle performs consistency based refinement, but with the ground truth label.
  • Figure 5: Temporal Consistency of Predictions (PL-PredConsis-IoU) with and without MRFusion. DeeplabV2 backbone trained on Viper$\to$Cityscapes-Seq.
  • ...and 1 more figures