Table of Contents
Fetching ...

Source-free Domain Adaptation for Video Object Detection Under Adverse Image Conditions

Xingguang Zhang, Chih-Hsien Chou

TL;DR

This work tackles adapting video object detectors to adverse, unlabeled target domains without access to source data. It introduces STAR-MT, a source-free domain adaptation method for one-stage VOD that alternates Temporal Refinement Stage and Spatial Refinement Stage within a mean-teacher framework to progressively align temporal and spatial features. The Temporal Refinement Stage transfers knowledge through EMA-guided feature alignment and frame masking, while the Spatial Refinement Stage updates the backbone using high-quality pseudo labels derived from the teacher's class scores, augmented by a certainty-aware loss. Experiments on synthetic degradations of ImageNetVOD show STAR-MT delivering significant performance gains across noise, turbulence, and haze, approaching supervised fine-tuning and providing a practical baseline for real-world deployment of robust VOD under unknown adverse conditions.

Abstract

When deploying pre-trained video object detectors in real-world scenarios, the domain gap between training and testing data caused by adverse image conditions often leads to performance degradation. Addressing this issue becomes particularly challenging when only the pre-trained model and degraded videos are available. Although various source-free domain adaptation (SFDA) methods have been proposed for single-frame object detectors, SFDA for video object detection (VOD) remains unexplored. Moreover, most unsupervised domain adaptation works for object detection rely on two-stage detectors, while SFDA for one-stage detectors, which are more vulnerable to fine-tuning, is not well addressed in the literature. In this paper, we propose Spatial-Temporal Alternate Refinement with Mean Teacher (STAR-MT), a simple yet effective SFDA method for VOD. Specifically, we aim to improve the performance of the one-stage VOD method, YOLOV, under adverse image conditions, including noise, air turbulence, and haze. Extensive experiments on the ImageNetVOD dataset and its degraded versions demonstrate that our method consistently improves video object detection performance in challenging imaging conditions, showcasing its potential for real-world applications.

Source-free Domain Adaptation for Video Object Detection Under Adverse Image Conditions

TL;DR

This work tackles adapting video object detectors to adverse, unlabeled target domains without access to source data. It introduces STAR-MT, a source-free domain adaptation method for one-stage VOD that alternates Temporal Refinement Stage and Spatial Refinement Stage within a mean-teacher framework to progressively align temporal and spatial features. The Temporal Refinement Stage transfers knowledge through EMA-guided feature alignment and frame masking, while the Spatial Refinement Stage updates the backbone using high-quality pseudo labels derived from the teacher's class scores, augmented by a certainty-aware loss. Experiments on synthetic degradations of ImageNetVOD show STAR-MT delivering significant performance gains across noise, turbulence, and haze, approaching supervised fine-tuning and providing a practical baseline for real-world deployment of robust VOD under unknown adverse conditions.

Abstract

When deploying pre-trained video object detectors in real-world scenarios, the domain gap between training and testing data caused by adverse image conditions often leads to performance degradation. Addressing this issue becomes particularly challenging when only the pre-trained model and degraded videos are available. Although various source-free domain adaptation (SFDA) methods have been proposed for single-frame object detectors, SFDA for video object detection (VOD) remains unexplored. Moreover, most unsupervised domain adaptation works for object detection rely on two-stage detectors, while SFDA for one-stage detectors, which are more vulnerable to fine-tuning, is not well addressed in the literature. In this paper, we propose Spatial-Temporal Alternate Refinement with Mean Teacher (STAR-MT), a simple yet effective SFDA method for VOD. Specifically, we aim to improve the performance of the one-stage VOD method, YOLOV, under adverse image conditions, including noise, air turbulence, and haze. Extensive experiments on the ImageNetVOD dataset and its degraded versions demonstrate that our method consistently improves video object detection performance in challenging imaging conditions, showcasing its potential for real-world applications.
Paper Structure (17 sections, 8 equations, 5 figures, 4 tables)

This paper contains 17 sections, 8 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The scope of this work: we aim to adapt the video object detection model trained on clean image sequences to degraded image sequences under the condition that the data from the source domain and ground truth labels of the target domain are unavailable during the adaptation.
  • Figure 2: Overview of the proposed STAR-MT for source-free adaptive video object detection. The domain adaptive fine-tuning alternately operates in two stages: (a) Temporal Refinement Stage (TRS) and (b) Spatial Refinement Stage (SRS).
  • Figure 3: A snippet of the ImageNetVOD dataset and three forms of degradation. The original frames are from the testing video ILSVRC2015_test_00028000.mp4 and $t=32$.
  • Figure 4: Visual comparison before and after the SFDA by STAR-MT. All experiments are conducted with YOLOV-S.
  • Figure 5: The teacher model's AP50 and mean self-entropy $H$ variation in the STAR-MT training of YOLOV-S and YOLOV-L. Both experiments are conducted on clean $\rightarrow$ haze. The $H$ indicating the best teacher model are marked in the figures with "+".