Table of Contents
Fetching ...

Source-Free Domain Adaptation for YOLO Object Detection

Simon Varailhon, Masih Aminbeidokhti, Marco Pedersoli, Eric Granger

TL;DR

This work introduces SF-YOLO, the first source-free domain adaptation method tailored to one-stage YOLO detectors for real-time object detection. It combines a learned Target Augmentation Module (TAM) with a mean-teacher framework and a novel Student Stabilisation Module (SSM) to stabilize training in the absence of labeled target data, while preserving inference speed. The approach achieves competitive or superior performance relative to Faster-RCNN–based SFDA methods and even some source-data–dependent UDA methods across Cityscapes, Foggy Cityscapes, Sim10k, and KITTI, with robust stability and minimal hyperparameter tuning. The paper also analyzes feature alignment strategies, finding that explicit alignment is unnecessary in SFDA for YOLO and that EMA+SSM provides a practical, tuning-free pathway to reliable adaptation in real-world systems.

Abstract

Source-free domain adaptation (SFDA) is a challenging problem in object detection, where a pre-trained source model is adapted to a new target domain without using any source domain data for privacy and efficiency reasons. Most state-of-the-art SFDA methods for object detection have been proposed for Faster-RCNN, a detector that is known to have high computational complexity. This paper focuses on domain adaptation techniques for real-world vision systems, particularly for the YOLO family of single-shot detectors known for their fast baselines and practical applications. Our proposed SFDA method - Source-Free YOLO (SF-YOLO) - relies on a teacher-student framework in which the student receives images with a learned, target domain-specific augmentation, allowing the model to be trained with only unlabeled target data and without requiring feature alignment. A challenge with self-training using a mean-teacher architecture in the absence of labels is the rapid decline of accuracy due to noisy or drifting pseudo-labels. To address this issue, a teacher-to-student communication mechanism is introduced to help stabilize the training and reduce the reliance on annotated target data for model selection. Despite its simplicity, our approach is competitive with state-of-the-art detectors on several challenging benchmark datasets, even sometimes outperforming methods that use source data for adaptation.

Source-Free Domain Adaptation for YOLO Object Detection

TL;DR

This work introduces SF-YOLO, the first source-free domain adaptation method tailored to one-stage YOLO detectors for real-time object detection. It combines a learned Target Augmentation Module (TAM) with a mean-teacher framework and a novel Student Stabilisation Module (SSM) to stabilize training in the absence of labeled target data, while preserving inference speed. The approach achieves competitive or superior performance relative to Faster-RCNN–based SFDA methods and even some source-data–dependent UDA methods across Cityscapes, Foggy Cityscapes, Sim10k, and KITTI, with robust stability and minimal hyperparameter tuning. The paper also analyzes feature alignment strategies, finding that explicit alignment is unnecessary in SFDA for YOLO and that EMA+SSM provides a practical, tuning-free pathway to reliable adaptation in real-world systems.

Abstract

Source-free domain adaptation (SFDA) is a challenging problem in object detection, where a pre-trained source model is adapted to a new target domain without using any source domain data for privacy and efficiency reasons. Most state-of-the-art SFDA methods for object detection have been proposed for Faster-RCNN, a detector that is known to have high computational complexity. This paper focuses on domain adaptation techniques for real-world vision systems, particularly for the YOLO family of single-shot detectors known for their fast baselines and practical applications. Our proposed SFDA method - Source-Free YOLO (SF-YOLO) - relies on a teacher-student framework in which the student receives images with a learned, target domain-specific augmentation, allowing the model to be trained with only unlabeled target data and without requiring feature alignment. A challenge with self-training using a mean-teacher architecture in the absence of labels is the rapid decline of accuracy due to noisy or drifting pseudo-labels. To address this issue, a teacher-to-student communication mechanism is introduced to help stabilize the training and reduce the reliance on annotated target data for model selection. Despite its simplicity, our approach is competitive with state-of-the-art detectors on several challenging benchmark datasets, even sometimes outperforming methods that use source data for adaptation.
Paper Structure (16 sections, 10 equations, 12 figures, 6 tables, 1 algorithm)

This paper contains 16 sections, 10 equations, 12 figures, 6 tables, 1 algorithm.

Figures (12)

  • Figure 1: The proposed SF-YOLO training architecture. First, TAM is trained using all the target images from the training set. These learned augmented images are then used as data augmentation for the student while the teacher receives unmodified target images. The student detector learns through backpropagation, as per \ref{['eq:yolo_loss']}, and then updates the teacher detector using EMA for each batch. Finally, at a lower frequency, once every epoch, the teacher updates the student with SSM to stabilize the training.
  • Figure 1: Selected sample from the four datasets used in our experiments.
  • Figure 2: The training curves of the target augmented mean-teacher with and without SSM using YOLOv5l on the C2F scenario across different learning rates. SSM prevents MT from quick deterioration and reaches better final performance.
  • Figure 2: The training curves of the target augmented mean-teacher with and without SSM using YOLOv5s on the C2F scenario across different learning rates. SSM prevents MT from quick deterioration and reaches better final performance.
  • Figure 3: Examples target domain detections for the C2F scenario. Each color represents a class. Our approach qualitatively exhibits comparable performance to CAST-YOLO liu2023cast (see \ref{['tab:foggy']}), yet we did not utilize labeled source data during the adaptation phase.
  • ...and 7 more figures