Table of Contents
Fetching ...

CLDA-YOLO: Visual Contrastive Learning Based Domain Adaptive YOLO Detector

Tianheng Qiu, Ka Lung Law, Guanghua Pan, Jufei Wang, Xin Gao, Xuan Huang, Hu Wei

TL;DR

This paper addresses the challenge of unsupervised domain adaptation for single-stage object detectors, focusing on YOLO under domain shifts. It introduces CLDA-YOLO, a teacher–student framework augmented with uncertainty-aware pseudo-labeling, dynamic data augmentation, and a multi-stage visual contrastive learning strategy that aligns backbone and head features across domains. The approach achieves state-of-the-art or competitive results across multiple domain-shift benchmarks, notably outperforming prior DAOD methods on Cityscapes→Foggy Cityscapes, and demonstrates effective component-wise gains via ablations. The method provides a fast, scalable solution for cross-domain detection with practical implications for real-world deployments of YOLO in varied environments.

Abstract

Unsupervised domain adaptive (UDA) algorithms can markedly enhance the performance of object detectors under conditions of domain shifts, thereby reducing the necessity for extensive labeling and retraining. Current domain adaptive object detection algorithms primarily cater to two-stage detectors, which tend to offer minimal improvements when directly applied to single-stage detectors such as YOLO. Intending to benefit the YOLO detector from UDA, we build a comprehensive domain adaptive architecture using a teacher-student cooperative system for the YOLO detector. In this process, we propose uncertainty learning to cope with pseudo-labeling generated by the teacher model with extreme uncertainty and leverage dynamic data augmentation to asymptotically adapt the teacher-student system to the environment. To address the inability of single-stage object detectors to align at multiple stages, we utilize a unified visual contrastive learning paradigm that aligns instance at backbone and head respectively, which steadily improves the robustness of the detectors in cross-domain tasks. In summary, we present an unsupervised domain adaptive YOLO detector based on visual contrastive learning (CLDA-YOLO), which achieves highly competitive results across multiple domain adaptive datasets without any reduction in inference speed.

CLDA-YOLO: Visual Contrastive Learning Based Domain Adaptive YOLO Detector

TL;DR

This paper addresses the challenge of unsupervised domain adaptation for single-stage object detectors, focusing on YOLO under domain shifts. It introduces CLDA-YOLO, a teacher–student framework augmented with uncertainty-aware pseudo-labeling, dynamic data augmentation, and a multi-stage visual contrastive learning strategy that aligns backbone and head features across domains. The approach achieves state-of-the-art or competitive results across multiple domain-shift benchmarks, notably outperforming prior DAOD methods on Cityscapes→Foggy Cityscapes, and demonstrates effective component-wise gains via ablations. The method provides a fast, scalable solution for cross-domain detection with practical implications for real-world deployments of YOLO in varied environments.

Abstract

Unsupervised domain adaptive (UDA) algorithms can markedly enhance the performance of object detectors under conditions of domain shifts, thereby reducing the necessity for extensive labeling and retraining. Current domain adaptive object detection algorithms primarily cater to two-stage detectors, which tend to offer minimal improvements when directly applied to single-stage detectors such as YOLO. Intending to benefit the YOLO detector from UDA, we build a comprehensive domain adaptive architecture using a teacher-student cooperative system for the YOLO detector. In this process, we propose uncertainty learning to cope with pseudo-labeling generated by the teacher model with extreme uncertainty and leverage dynamic data augmentation to asymptotically adapt the teacher-student system to the environment. To address the inability of single-stage object detectors to align at multiple stages, we utilize a unified visual contrastive learning paradigm that aligns instance at backbone and head respectively, which steadily improves the robustness of the detectors in cross-domain tasks. In summary, we present an unsupervised domain adaptive YOLO detector based on visual contrastive learning (CLDA-YOLO), which achieves highly competitive results across multiple domain adaptive datasets without any reduction in inference speed.

Paper Structure

This paper contains 11 sections, 8 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Comparison of mAP@.5 results under two experiment settings, our proposed method obtains the best performance.
  • Figure 2: The overall architecture of the proposed CLDA-YOLO. Following the teacher-student cooperative learning architecture, we built a domain adaptive architecture for YOLO detector, where the teacher model generates pseudo-labels to compute distillation loss and uncertainty loss, the student model computes the source-domain supervised loss and distillation loss in addition to the contrastive alignment loss that we additionally set in order to enable the detector to perform a coherent alignment on the backbone and the head, respectively.
  • Figure 3: Simple schema of our Contrastive Learning-based Alignment. For each box, we provide tuple to describe it, which means (features, confidence, category). The queue update is executed after the whole batch has been computed.
  • Figure 4: Comparison of CLDA-YOLO's prediction visualization, with images of normal weather, rainy day, and foggy day, from left to right, with Source-Only model's detection results in the first row, and CLDA-YOLO in the second row.
  • Figure 5: Feature visualization of Cityscapes $\rightarrow$ Foggy Cityscapes by T-SNE, which generated by each detector head. Categories and domains are distinguish by marker and color respectively. Zoom in for more detailed view.