Table of Contents
Fetching ...

RT-DATR: Real-time Unsupervised Domain Adaptive Detection Transformer with Adversarial Feature Alignment

Feng Lv, Guoqing Li, Jin Li, Chunlong Xia

TL;DR

RT-DATR tackles unsupervised domain adaptation for real-time DETR-style detectors by augmenting RT-DETR with three adversarial feature-alignment modules—Local Object-level Feature Alignment (LOFA), Scene Semantic Feature Alignment (SSFA), and Instance Feature Alignment (IFA)—plus a decoupled domain query and a decoder-layer consistency loss. The method preserves inference speed while improving cross-domain generalization across weather, scene, artistic-to-real, and cross-camera tasks, achieving state-of-the-art results on benchmarks like Cityscapes→Foggy Cityscapes, Cityscapes→BDD100K, Sim10K→Cityscapes, and KITTI→Cityscapes. It achieves this through multi-level alignment at the backbone, encoder, and decoder stages, without adding inference latency. The work demonstrates that targeted, inference-free domain alignment in transformer-based detectors can substantially reduce domain gaps, enabling practical, real-time cross-domain object detection.

Abstract

Despite domain-adaptive object detectors based on CNN and transformers have made significant progress in cross-domain detection tasks, it is regrettable that domain adaptation for real-time transformer-based detectors has not yet been explored. Directly applying existing domain adaptation algorithms has proven to be suboptimal. In this paper, we propose RT-DATR, a simple and efficient real-time domain adaptive detection transformer. Building on RT-DETR as our base detector, we first introduce a local object-level feature alignment module to significantly enhance the feature representation of domain invariance during object transfer. Additionally, we introduce a scene semantic feature alignment module designed to boost cross-domain detection performance by aligning scene semantic features. Finally, we introduced a domain query and decoupled it from the object query to further align the instance feature distribution within the decoder layer, reduce the domain gap, and maintain discriminative ability. Experimental results on various cross-domian benchmarks demonstrate that our method outperforms current state-of-the-art approaches. Code is available at https://github.com/Jeremy-lf/RT-DATR.

RT-DATR: Real-time Unsupervised Domain Adaptive Detection Transformer with Adversarial Feature Alignment

TL;DR

RT-DATR tackles unsupervised domain adaptation for real-time DETR-style detectors by augmenting RT-DETR with three adversarial feature-alignment modules—Local Object-level Feature Alignment (LOFA), Scene Semantic Feature Alignment (SSFA), and Instance Feature Alignment (IFA)—plus a decoupled domain query and a decoder-layer consistency loss. The method preserves inference speed while improving cross-domain generalization across weather, scene, artistic-to-real, and cross-camera tasks, achieving state-of-the-art results on benchmarks like Cityscapes→Foggy Cityscapes, Cityscapes→BDD100K, Sim10K→Cityscapes, and KITTI→Cityscapes. It achieves this through multi-level alignment at the backbone, encoder, and decoder stages, without adding inference latency. The work demonstrates that targeted, inference-free domain alignment in transformer-based detectors can substantially reduce domain gaps, enabling practical, real-time cross-domain object detection.

Abstract

Despite domain-adaptive object detectors based on CNN and transformers have made significant progress in cross-domain detection tasks, it is regrettable that domain adaptation for real-time transformer-based detectors has not yet been explored. Directly applying existing domain adaptation algorithms has proven to be suboptimal. In this paper, we propose RT-DATR, a simple and efficient real-time domain adaptive detection transformer. Building on RT-DETR as our base detector, we first introduce a local object-level feature alignment module to significantly enhance the feature representation of domain invariance during object transfer. Additionally, we introduce a scene semantic feature alignment module designed to boost cross-domain detection performance by aligning scene semantic features. Finally, we introduced a domain query and decoupled it from the object query to further align the instance feature distribution within the decoder layer, reduce the domain gap, and maintain discriminative ability. Experimental results on various cross-domian benchmarks demonstrate that our method outperforms current state-of-the-art approaches. Code is available at https://github.com/Jeremy-lf/RT-DATR.

Paper Structure

This paper contains 17 sections, 8 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Performance comparison of different benchmarks.
  • Figure 2: The architecture of RT-DATR. It consists of the base detector RT-DETR, along with three feature alignment modules: local object-level, scene semantic, and instance feature alignment modules. Object proposals are the object regions predicted by the model for object-level feature alignment.
  • Figure 3: Comparison of visualization results using different methods across various cross-domain detection datasets: the BDD100K dataset at the top, the Cityscapes dataset in the middle, and the Cityscapes Foggy dataset at the bottom.
  • Figure 4: Feature visualization of the two domains on Cityscapes to Foggy Cityscapes by t-SNE. The blue and red points denotes source and target features respectively.