GOOD: Towards Domain Generalized Orientated Object Detection
Qi Bi, Beichen Zhou, Jingjun Yi, Wei Ji, Haolan Zhan, Gui-Song Xia
TL;DR
The paper tackles the challenge of domain generalization for oriented object detection in aerial imagery, where unseen target domains exhibit substantial style variation that harms content representation and orientation accuracy. It introduces GOOD, a backbone-agnostic detector empowered by CLIP-driven style hallucination and two consistency modules: rotation-aware content consistency learning (RAC) and style consistency learning (SEC). RAC aligns horizontal and rotated region proposals across original and style-hallucinated views to stabilize orientation cues, while SEC enforces content invariance through Jensen-Shannon Divergence between category distributions across styles. Comprehensive cross-domain experiments across FAIR1M, DOTA variants, SODA, and HRSC demonstrate that GOOD achieves state-of-the-art generalization to unseen domains, with ablations confirming the effectiveness of each component. The work advances practical domain-generalized oriented detection by leveraging vision-language pretraining to enrich style diversity and by formalizing robust cross-domain evaluation protocols.
Abstract
Oriented object detection has been rapidly developed in the past few years, but most of these methods assume the training and testing images are under the same statistical distribution, which is far from reality. In this paper, we propose the task of domain generalized oriented object detection, which intends to explore the generalization of oriented object detectors on arbitrary unseen target domains. Learning domain generalized oriented object detectors is particularly challenging, as the cross-domain style variation not only negatively impacts the content representation, but also leads to unreliable orientation predictions. To address these challenges, we propose a generalized oriented object detector (GOOD). After style hallucination by the emerging contrastive language-image pre-training (CLIP), it consists of two key components, namely, rotation-aware content consistency learning (RAC) and style consistency learning (SEC). The proposed RAC allows the oriented object detector to learn stable orientation representation from style-diversified samples. The proposed SEC further stabilizes the generalization ability of content representation from different image styles. Extensive experiments on multiple cross-domain settings show the state-of-the-art performance of GOOD. Source code will be publicly available.
