Source-Free Domain Adaptation for YOLO Object Detection
Simon Varailhon, Masih Aminbeidokhti, Marco Pedersoli, Eric Granger
TL;DR
This work introduces SF-YOLO, the first source-free domain adaptation method tailored to one-stage YOLO detectors for real-time object detection. It combines a learned Target Augmentation Module (TAM) with a mean-teacher framework and a novel Student Stabilisation Module (SSM) to stabilize training in the absence of labeled target data, while preserving inference speed. The approach achieves competitive or superior performance relative to Faster-RCNN–based SFDA methods and even some source-data–dependent UDA methods across Cityscapes, Foggy Cityscapes, Sim10k, and KITTI, with robust stability and minimal hyperparameter tuning. The paper also analyzes feature alignment strategies, finding that explicit alignment is unnecessary in SFDA for YOLO and that EMA+SSM provides a practical, tuning-free pathway to reliable adaptation in real-world systems.
Abstract
Source-free domain adaptation (SFDA) is a challenging problem in object detection, where a pre-trained source model is adapted to a new target domain without using any source domain data for privacy and efficiency reasons. Most state-of-the-art SFDA methods for object detection have been proposed for Faster-RCNN, a detector that is known to have high computational complexity. This paper focuses on domain adaptation techniques for real-world vision systems, particularly for the YOLO family of single-shot detectors known for their fast baselines and practical applications. Our proposed SFDA method - Source-Free YOLO (SF-YOLO) - relies on a teacher-student framework in which the student receives images with a learned, target domain-specific augmentation, allowing the model to be trained with only unlabeled target data and without requiring feature alignment. A challenge with self-training using a mean-teacher architecture in the absence of labels is the rapid decline of accuracy due to noisy or drifting pseudo-labels. To address this issue, a teacher-to-student communication mechanism is introduced to help stabilize the training and reduce the reliance on annotated target data for model selection. Despite its simplicity, our approach is competitive with state-of-the-art detectors on several challenging benchmark datasets, even sometimes outperforming methods that use source data for adaptation.
