We're Not Using Videos Effectively: An Updated Domain Adaptive Video Segmentation Baseline

Simar Kareer; Vivek Vijaykumar; Harsh Maheshwari; Prithvijit Chattopadhyay; Judy Hoffman; Viraj Prabhu

We're Not Using Videos Effectively: An Updated Domain Adaptive Video Segmentation Baseline

Simar Kareer, Vivek Vijaykumar, Harsh Maheshwari, Prithvijit Chattopadhyay, Judy Hoffman, Viraj Prabhu

TL;DR

This work benchmarks state-of-the-art Image-DAS methods against Video-DAS baselines on Video-DAS tasks, revealing that Image-DAS methods like HRDA and HRDA+MIC outperform specialized Video-DAS approaches across standard shifts. A central finding is that multi-resolution fusion drives the bulk of Image-DAS gains on video data, casting doubt on the added value of many Video-DAS techniques in current benchmarks. The authors also introduce UnifiedVideoDA, an open-source framework to enable unified benchmarking and cross-pollination between Image-DAS and Video-DAS research, and they provide extensive analyses of combining techniques and pseudo-label refinement. Overall, the results suggest that Image-DAS advances currently offer stronger, more consistent improvements for sim-to-real semantic segmentation than contemporary Video-DAS methods, while highlighting directions for future work in cross-benchmark methodologies and refinement strategies.

Abstract

There has been abundant work in unsupervised domain adaptation for semantic segmentation (DAS) seeking to adapt a model trained on images from a labeled source domain to an unlabeled target domain. While the vast majority of prior work has studied this as a frame-level Image-DAS problem, a few Video-DAS works have sought to additionally leverage the temporal signal present in adjacent frames. However, Video-DAS works have historically studied a distinct set of benchmarks from Image-DAS, with minimal cross-benchmarking. In this work, we address this gap. Surprisingly, we find that (1) even after carefully controlling for data and model architecture, state-of-the-art Image-DAS methods (HRDA and HRDA+MIC) outperform Video-DAS methods on established Video-DAS benchmarks (+14.5 mIoU on Viper$\rightarrow$CityscapesSeq, +19.0 mIoU on Synthia$\rightarrow$CityscapesSeq), and (2) naive combinations of Image-DAS and Video-DAS techniques only lead to marginal improvements across datasets. To avoid siloed progress between Image-DAS and Video-DAS, we open-source our codebase with support for a comprehensive set of Video-DAS and Image-DAS methods on a common benchmark. Code available at https://github.com/SimarKareer/UnifiedVideoDA

We're Not Using Videos Effectively: An Updated Domain Adaptive Video Segmentation Baseline

TL;DR

Abstract

CityscapesSeq, +19.0 mIoU on Synthia

CityscapesSeq), and (2) naive combinations of Image-DAS and Video-DAS techniques only lead to marginal improvements across datasets. To avoid siloed progress between Image-DAS and Video-DAS, we open-source our codebase with support for a comprehensive set of Video-DAS and Image-DAS methods on a common benchmark. Code available at https://github.com/SimarKareer/UnifiedVideoDA

Paper Structure (23 sections, 8 equations, 6 figures, 12 tables)

This paper contains 23 sections, 8 equations, 6 figures, 12 tables.

Introduction
Related work
Image-level UDA for Semantic Segmentation (Image-DAS)
Video-level Unsupervised Domain Adaptation
Preliminaries
Video UDA for Semantic Segmentation (Video-DAS)
Overview of Video-DAS and Image-DAS
Overview of Video-DAS techniques
Experiments
Experimental Setup
How do Video-DAS methods compare to updated Image-DAS baselines?
Can we combine techniques from Image-DAS and Video-DAS to improve performance?
Why is it difficult to combine Image-DAS and Video-DAS?
Exploring pseudo-label refinement strategies
Discussion and Future Work
...and 8 more sections

Figures (6)

Figure 1: Overview. Recent domain adaptive video segmentation methods do not compare against state-of-the-art baselines for Image-DAS. We perform the first such cross-benchmarking, and find that even after controlling for data and model architecture, Image-DAS methods strongly outperform Video-DAS methods on the two key Video-DAS benchmarks.
Figure 2: Simplified Training Pipeline. Temporally separated frames are first augmented and then passed through a model $h_{\theta}\xspace$ to produce source and target predictions, which produce supervised and adaptation losses. In this standard self-training pipeline, Video-DAS methods typically add one or more of the following techniques: consistent mixup, ACCEL, pseudo-label refinement, and video discriminators. We further elaborate on each of these techniques in Figure \ref{['fig:vidTechs']}.
Figure 3: Key Video Techniques.a) Consistent mix-up ensures that paired frames receive the same class mix-up, in this case "car" and "truck". The ACCEL architecture makes predictions $\hat{y}_t$ and $\hat{y}_{t+k}$, aligns them via $\texttt{prop}(y_{t+k}, o_{t\rightarrow t+k})$, then fuses them with a $1\times1$ convolution. b) Video discriminators learn a classifier to distinguish whether temporally stacked features belong to the source or target domain, in conjunction with a feature encoder that is updated adversarially to make the two indistinguishable. c) Pseudo-label refinement improves predictions by fusing $\hat{y}_t$ and $\hat{y}_{t+k}$ based on one of several criteria.
Figure 4: Each pseudo-label refinement strategy takes as input the current prediction $\hat{y}_t$, as well as a prediction for a future frame $\hat{y}_{t+k}$ warped via optical flow to the current frame, and merges them together to make a refined prediction $\hat{y'}_t$. a) Consistency discards predictions that are inconsistent across frames. b) Max confidence selects the more confident prediction between frames. c) Warped frame uses the warped prediction instead of the current prediction. d) Oracle performs consistency based refinement, but with the ground truth label.
Figure 5: Temporal Consistency of Predictions (PL-PredConsis-IoU) with and without MRFusion. DeeplabV2 backbone trained on Viper$\to$Cityscapes-Seq.
...and 1 more figures

We're Not Using Videos Effectively: An Updated Domain Adaptive Video Segmentation Baseline

TL;DR

Abstract

We're Not Using Videos Effectively: An Updated Domain Adaptive Video Segmentation Baseline

Authors

TL;DR

Abstract

Table of Contents

Figures (6)