Large Self-Supervised Models Bridge the Gap in Domain Adaptive Object Detection
Marc-Antoine Lavoie, Anas Mahmoud, Steven L. Waslander
TL;DR
This work tackles domain adaptive object detection by decoupling target pseudo-label generation from the learner and leveraging large vision foundation models. It introduces DINO Teacher, which uses a frozen DINOv2 backbone to train a simple detector on source data as an external labeller, generating target pseudo-labels, and employs a separate patch-level feature alignment mechanism to pull source and target representations toward the DINO space. Across Cityscapes→BD100k, Cityscapes→Foggy Cityscapes, and Cityscapes→ACDC transfers, DT achieves state-of-the-art performance, with notable improvements on rare classes and under adverse conditions. The approach demonstrates that external VFMs can substantially improve pseudo-label quality and domain generalization without requiring extensive fine-tuning of the backbone, offering a practical boost for real-world DAOD systems.
Abstract
The current state-of-the-art methods in domain adaptive object detection (DAOD) use Mean Teacher self-labelling, where a teacher model, directly derived as an exponential moving average of the student model, is used to generate labels on the target domain which are then used to improve both models in a positive loop. This couples learning and generating labels on the target domain, and other recent works also leverage the generated labels to add additional domain alignment losses. We believe this coupling is brittle and excessively constrained: there is no guarantee that a student trained only on source data can generate accurate target domain labels and initiate the positive feedback loop, and much better target domain labels can likely be generated by using a large pretrained network that has been exposed to much more data. Vision foundational models are exactly such models, and they have shown impressive task generalization capabilities even when frozen. We want to leverage these models for DAOD and introduce DINO Teacher, which consists of two components. First, we train a new labeller on source data only using a large frozen DINOv2 backbone and show it generates more accurate labels than Mean Teacher. Next, we align the student's source and target image patch features with those from a DINO encoder, driving source and target representations closer to the generalizable DINO representation. We obtain state-of-the-art performance on multiple DAOD datasets. Code available at https://github.com/TRAILab/DINO_Teacher
