Table of Contents
Fetching ...

Large Self-Supervised Models Bridge the Gap in Domain Adaptive Object Detection

Marc-Antoine Lavoie, Anas Mahmoud, Steven L. Waslander

TL;DR

This work tackles domain adaptive object detection by decoupling target pseudo-label generation from the learner and leveraging large vision foundation models. It introduces DINO Teacher, which uses a frozen DINOv2 backbone to train a simple detector on source data as an external labeller, generating target pseudo-labels, and employs a separate patch-level feature alignment mechanism to pull source and target representations toward the DINO space. Across Cityscapes→BD100k, Cityscapes→Foggy Cityscapes, and Cityscapes→ACDC transfers, DT achieves state-of-the-art performance, with notable improvements on rare classes and under adverse conditions. The approach demonstrates that external VFMs can substantially improve pseudo-label quality and domain generalization without requiring extensive fine-tuning of the backbone, offering a practical boost for real-world DAOD systems.

Abstract

The current state-of-the-art methods in domain adaptive object detection (DAOD) use Mean Teacher self-labelling, where a teacher model, directly derived as an exponential moving average of the student model, is used to generate labels on the target domain which are then used to improve both models in a positive loop. This couples learning and generating labels on the target domain, and other recent works also leverage the generated labels to add additional domain alignment losses. We believe this coupling is brittle and excessively constrained: there is no guarantee that a student trained only on source data can generate accurate target domain labels and initiate the positive feedback loop, and much better target domain labels can likely be generated by using a large pretrained network that has been exposed to much more data. Vision foundational models are exactly such models, and they have shown impressive task generalization capabilities even when frozen. We want to leverage these models for DAOD and introduce DINO Teacher, which consists of two components. First, we train a new labeller on source data only using a large frozen DINOv2 backbone and show it generates more accurate labels than Mean Teacher. Next, we align the student's source and target image patch features with those from a DINO encoder, driving source and target representations closer to the generalizable DINO representation. We obtain state-of-the-art performance on multiple DAOD datasets. Code available at https://github.com/TRAILab/DINO_Teacher

Large Self-Supervised Models Bridge the Gap in Domain Adaptive Object Detection

TL;DR

This work tackles domain adaptive object detection by decoupling target pseudo-label generation from the learner and leveraging large vision foundation models. It introduces DINO Teacher, which uses a frozen DINOv2 backbone to train a simple detector on source data as an external labeller, generating target pseudo-labels, and employs a separate patch-level feature alignment mechanism to pull source and target representations toward the DINO space. Across Cityscapes→BD100k, Cityscapes→Foggy Cityscapes, and Cityscapes→ACDC transfers, DT achieves state-of-the-art performance, with notable improvements on rare classes and under adverse conditions. The approach demonstrates that external VFMs can substantially improve pseudo-label quality and domain generalization without requiring extensive fine-tuning of the backbone, offering a practical boost for real-world DAOD systems.

Abstract

The current state-of-the-art methods in domain adaptive object detection (DAOD) use Mean Teacher self-labelling, where a teacher model, directly derived as an exponential moving average of the student model, is used to generate labels on the target domain which are then used to improve both models in a positive loop. This couples learning and generating labels on the target domain, and other recent works also leverage the generated labels to add additional domain alignment losses. We believe this coupling is brittle and excessively constrained: there is no guarantee that a student trained only on source data can generate accurate target domain labels and initiate the positive feedback loop, and much better target domain labels can likely be generated by using a large pretrained network that has been exposed to much more data. Vision foundational models are exactly such models, and they have shown impressive task generalization capabilities even when frozen. We want to leverage these models for DAOD and introduce DINO Teacher, which consists of two components. First, we train a new labeller on source data only using a large frozen DINOv2 backbone and show it generates more accurate labels than Mean Teacher. Next, we align the student's source and target image patch features with those from a DINO encoder, driving source and target representations closer to the generalizable DINO representation. We obtain state-of-the-art performance on multiple DAOD datasets. Code available at https://github.com/TRAILab/DINO_Teacher

Paper Structure

This paper contains 37 sections, 4 equations, 5 figures, 14 tables, 1 algorithm.

Figures (5)

  • Figure 1: Cosine similarity of patch feature from Cityscapes to BDD100k.. We evaluate the similarity between the yellow star region in the Cityscapes image and all other regions in both images. Compared to the EMA Teacher, DINOv2 generates semantically stable features across domains, justifying the choice to use it for feature alignment and as a frozen backbone for our labeller.
  • Figure 2: Diagram of our proposed method.Offline Labeller Training: we add a detector head to a frozen DINOv2 encoder and train it with source images only. Offline Label Generation: we combine and freeze the labeller backbone and detector, and generate target pseudo-labels. Online Student Training: We train a student network using source ground truth boxes and target pseudo-labels, and align patch features to a frozen DINOv2 encoder. During inference, the alignment encoder and projection MLP are not used.
  • Figure 3: t-SNE of the backbone instance-level embeddings across domains. Subfigures \ref{['fig:subtsne_wo_align']} and \ref{['fig:subtsne_w_align']} are taken from VGG16 features after 20k training iterations of supervised training on source only (Cityscapes), following the protocol defined in \ref{['sec:experiments']}. Without alignment \ref{['fig:subtsne_wo_align']}, there is confusion between all similar classes. With alignment \ref{['fig:subtsne_w_align']}, overlap is reduced, particularly between persons ($\bullet$,$\times$) and riders ($\bullet$,$\times$). Pretrained DINOv2 \ref{['fig:subtsne_dino']} has well separated clusters.
  • Figure 4: Quality of generated pseudo-labels. Ratio of number of high-confidence pseudo-labels compared to the total number of instances per class. The student model (SO and MT) is much weaker for the rare classes, and as training progresses Mean Teacher pseudo-labels, the label quality becomes worse.
  • Figure 5: Qualitative results on target domain. We compare Adaptive Teacher (left) to our DINO Teacher (left) on BDD (rows 1 and 2), Foggy Cityscapes (rows 3 and 4) and ACDC Night (rows 5 and 6). Green, Yellow, Orange and Red indicate true positive, low-confidence positives, false positive, and false negatives respectively. We use a threshold of 0.7 for true positives and false positives.