Context-Aware Aerial Object Detection: Leveraging Inter-Object and Background Relationships
Botao Ren, Botian Xu, Xue Yang, Yifan Pu, Jingyi Wang, Zhidong Deng
TL;DR
This work tackles the problem of aerial object detection where proposals are traditionally processed independently, by introducing a Transformer-based framework that jointly models inter-object relationships and background context. It combines RoI Tokens from two-stage detectors with CLIP Tokens generated from multi-scale patches and textual descriptions, enabling cross-modal fusion of spatial, geometric, and semantic information. Spatial-geometric relations are encoded as attention biases, and attention is adaptively weighted by distance, scale, density, and IoU masking, with a self-supervised CLIP loss to regularize background semantics. Across DOTA-v1.0/v1.5/v2.0, DIOR-R, and HRSC2016, the method achieves state-of-the-art gains on mAP$_{50}$, particularly in densely populated scenes, and reduces scale inconsistencies, demonstrating the practical impact of relational reasoning in aerial imagery.
Abstract
In most modern object detection pipelines, the detection proposals are processed independently given the feature map. Therefore, they overlook the underlying relationships between objects and the surrounding background, which could have provided additional context for accurate detection. Because aerial imagery is almost orthographic, the spatial relations in image space closely align with those in the physical world, and inter-object and object-background relationships become particularly significant. To address this oversight, we propose a framework that leverages the strengths of Transformer-based models and Contrastive Language-Image Pre-training (CLIP) features to capture such relationships. Specifically, Building on two-stage detectors, we treat Region of Interest (RoI) proposals as tokens, accompanied by CLIP Tokens obtained from multi-level image segments. These tokens are then passed through a Transformer encoder, where specific spatial and geometric relations are incorporated into the attention weights, which are adaptively modulated and regularized. Additionally, we introduce self-supervised constraints on CLIP Tokens to ensure consistency. Extensive experiments on three benchmark datasets demonstrate that our approach achieves consistent improvements, setting new state-of-the-art results with increases of 1.37 mAP$_{50}$ on DOTA-v1.0, 5.30 mAP$_{50}$ on DOTA-v1.5, 2.30 mAP$_{50}$ on DOTA-v2.0 and 3.23 mAP$_{50}$ on DIOR-R.
