Table of Contents
Fetching ...

GOOD: Towards Domain Generalized Orientated Object Detection

Qi Bi, Beichen Zhou, Jingjun Yi, Wei Ji, Haolan Zhan, Gui-Song Xia

TL;DR

The paper tackles the challenge of domain generalization for oriented object detection in aerial imagery, where unseen target domains exhibit substantial style variation that harms content representation and orientation accuracy. It introduces GOOD, a backbone-agnostic detector empowered by CLIP-driven style hallucination and two consistency modules: rotation-aware content consistency learning (RAC) and style consistency learning (SEC). RAC aligns horizontal and rotated region proposals across original and style-hallucinated views to stabilize orientation cues, while SEC enforces content invariance through Jensen-Shannon Divergence between category distributions across styles. Comprehensive cross-domain experiments across FAIR1M, DOTA variants, SODA, and HRSC demonstrate that GOOD achieves state-of-the-art generalization to unseen domains, with ablations confirming the effectiveness of each component. The work advances practical domain-generalized oriented detection by leveraging vision-language pretraining to enrich style diversity and by formalizing robust cross-domain evaluation protocols.

Abstract

Oriented object detection has been rapidly developed in the past few years, but most of these methods assume the training and testing images are under the same statistical distribution, which is far from reality. In this paper, we propose the task of domain generalized oriented object detection, which intends to explore the generalization of oriented object detectors on arbitrary unseen target domains. Learning domain generalized oriented object detectors is particularly challenging, as the cross-domain style variation not only negatively impacts the content representation, but also leads to unreliable orientation predictions. To address these challenges, we propose a generalized oriented object detector (GOOD). After style hallucination by the emerging contrastive language-image pre-training (CLIP), it consists of two key components, namely, rotation-aware content consistency learning (RAC) and style consistency learning (SEC). The proposed RAC allows the oriented object detector to learn stable orientation representation from style-diversified samples. The proposed SEC further stabilizes the generalization ability of content representation from different image styles. Extensive experiments on multiple cross-domain settings show the state-of-the-art performance of GOOD. Source code will be publicly available.

GOOD: Towards Domain Generalized Orientated Object Detection

TL;DR

The paper tackles the challenge of domain generalization for oriented object detection in aerial imagery, where unseen target domains exhibit substantial style variation that harms content representation and orientation accuracy. It introduces GOOD, a backbone-agnostic detector empowered by CLIP-driven style hallucination and two consistency modules: rotation-aware content consistency learning (RAC) and style consistency learning (SEC). RAC aligns horizontal and rotated region proposals across original and style-hallucinated views to stabilize orientation cues, while SEC enforces content invariance through Jensen-Shannon Divergence between category distributions across styles. Comprehensive cross-domain experiments across FAIR1M, DOTA variants, SODA, and HRSC demonstrate that GOOD achieves state-of-the-art generalization to unseen domains, with ablations confirming the effectiveness of each component. The work advances practical domain-generalized oriented detection by leveraging vision-language pretraining to enrich style diversity and by formalizing robust cross-domain evaluation protocols.

Abstract

Oriented object detection has been rapidly developed in the past few years, but most of these methods assume the training and testing images are under the same statistical distribution, which is far from reality. In this paper, we propose the task of domain generalized oriented object detection, which intends to explore the generalization of oriented object detectors on arbitrary unseen target domains. Learning domain generalized oriented object detectors is particularly challenging, as the cross-domain style variation not only negatively impacts the content representation, but also leads to unreliable orientation predictions. To address these challenges, we propose a generalized oriented object detector (GOOD). After style hallucination by the emerging contrastive language-image pre-training (CLIP), it consists of two key components, namely, rotation-aware content consistency learning (RAC) and style consistency learning (SEC). The proposed RAC allows the oriented object detector to learn stable orientation representation from style-diversified samples. The proposed SEC further stabilizes the generalization ability of content representation from different image styles. Extensive experiments on multiple cross-domain settings show the state-of-the-art performance of GOOD. Source code will be publicly available.
Paper Structure (45 sections, 11 equations, 7 figures, 13 tables)

This paper contains 45 sections, 11 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: (a) Domain generalized oriented object detection aims to learn an oriented object detector from only source domain that can be well generalized to (b) arbitrary unseen target domains.
  • Figure 2: Quantifying the styles from different aerial image domains. The per-sample style is quantified by two variables, namely, mean and standard deviation huang2017arbitrary. Mean demonstrates how the samples are spread out in the feature space, whereas standard deviation demonstrates how the per-sample feature varies. For each aerial image domain, the per-sample mean and standard deviation is computed and visualized. Styles from different domains show a dramatic discrepancy.
  • Figure 3: Two key challenges when an oriented object detector generalizes to unseen target domains. (a) degraded category prediction: multiple ships are not detected. In the feature space, samples from each category are not properly clustered due to cross-domain style variation; (b) imprecise oriented prediction: the rotation angle does not align with the object orientation, which is reflected in both image space and blue arrows in feature space.
  • Figure 4: Framework overview of the proposed GOOD. After CLIP-driven style hallucination (Sec. \ref{['sec3.2']}), two key components, namely, rotation-aware consistency learning (RAC, in Sec. \ref{['sec3.3']}) and style consistency learning (SEC, in Sec. \ref{['sec3.4']}), are involved. RAC implements content consistency on both horizontal and rotated region of interests (HRoI and RRoI), as presented in Eq. \ref{['HRoICL']} and Eq. \ref{['RRoICL']}, respectively. SEC implements style consistency on category-wise representation from both original and style-hallucinated images (in Eq. \ref{['JSDloss']}).
  • Figure 5: t-SNE visualization of the cross-domain feature space. Blue, green and orange denotes the target domains of SODA, FAIR1M, HRSC, respectively. The more uniformly distributed from these domains, the better feature generalization.
  • ...and 2 more figures