Table of Contents
Fetching ...

Context-Aware Aerial Object Detection: Leveraging Inter-Object and Background Relationships

Botao Ren, Botian Xu, Xue Yang, Yifan Pu, Jingyi Wang, Zhidong Deng

TL;DR

This work tackles the problem of aerial object detection where proposals are traditionally processed independently, by introducing a Transformer-based framework that jointly models inter-object relationships and background context. It combines RoI Tokens from two-stage detectors with CLIP Tokens generated from multi-scale patches and textual descriptions, enabling cross-modal fusion of spatial, geometric, and semantic information. Spatial-geometric relations are encoded as attention biases, and attention is adaptively weighted by distance, scale, density, and IoU masking, with a self-supervised CLIP loss to regularize background semantics. Across DOTA-v1.0/v1.5/v2.0, DIOR-R, and HRSC2016, the method achieves state-of-the-art gains on mAP$_{50}$, particularly in densely populated scenes, and reduces scale inconsistencies, demonstrating the practical impact of relational reasoning in aerial imagery.

Abstract

In most modern object detection pipelines, the detection proposals are processed independently given the feature map. Therefore, they overlook the underlying relationships between objects and the surrounding background, which could have provided additional context for accurate detection. Because aerial imagery is almost orthographic, the spatial relations in image space closely align with those in the physical world, and inter-object and object-background relationships become particularly significant. To address this oversight, we propose a framework that leverages the strengths of Transformer-based models and Contrastive Language-Image Pre-training (CLIP) features to capture such relationships. Specifically, Building on two-stage detectors, we treat Region of Interest (RoI) proposals as tokens, accompanied by CLIP Tokens obtained from multi-level image segments. These tokens are then passed through a Transformer encoder, where specific spatial and geometric relations are incorporated into the attention weights, which are adaptively modulated and regularized. Additionally, we introduce self-supervised constraints on CLIP Tokens to ensure consistency. Extensive experiments on three benchmark datasets demonstrate that our approach achieves consistent improvements, setting new state-of-the-art results with increases of 1.37 mAP$_{50}$ on DOTA-v1.0, 5.30 mAP$_{50}$ on DOTA-v1.5, 2.30 mAP$_{50}$ on DOTA-v2.0 and 3.23 mAP$_{50}$ on DIOR-R.

Context-Aware Aerial Object Detection: Leveraging Inter-Object and Background Relationships

TL;DR

This work tackles the problem of aerial object detection where proposals are traditionally processed independently, by introducing a Transformer-based framework that jointly models inter-object relationships and background context. It combines RoI Tokens from two-stage detectors with CLIP Tokens generated from multi-scale patches and textual descriptions, enabling cross-modal fusion of spatial, geometric, and semantic information. Spatial-geometric relations are encoded as attention biases, and attention is adaptively weighted by distance, scale, density, and IoU masking, with a self-supervised CLIP loss to regularize background semantics. Across DOTA-v1.0/v1.5/v2.0, DIOR-R, and HRSC2016, the method achieves state-of-the-art gains on mAP, particularly in densely populated scenes, and reduces scale inconsistencies, demonstrating the practical impact of relational reasoning in aerial imagery.

Abstract

In most modern object detection pipelines, the detection proposals are processed independently given the feature map. Therefore, they overlook the underlying relationships between objects and the surrounding background, which could have provided additional context for accurate detection. Because aerial imagery is almost orthographic, the spatial relations in image space closely align with those in the physical world, and inter-object and object-background relationships become particularly significant. To address this oversight, we propose a framework that leverages the strengths of Transformer-based models and Contrastive Language-Image Pre-training (CLIP) features to capture such relationships. Specifically, Building on two-stage detectors, we treat Region of Interest (RoI) proposals as tokens, accompanied by CLIP Tokens obtained from multi-level image segments. These tokens are then passed through a Transformer encoder, where specific spatial and geometric relations are incorporated into the attention weights, which are adaptively modulated and regularized. Additionally, we introduce self-supervised constraints on CLIP Tokens to ensure consistency. Extensive experiments on three benchmark datasets demonstrate that our approach achieves consistent improvements, setting new state-of-the-art results with increases of 1.37 mAP on DOTA-v1.0, 5.30 mAP on DOTA-v1.5, 2.30 mAP on DOTA-v2.0 and 3.23 mAP on DIOR-R.
Paper Structure (25 sections, 8 equations, 5 figures, 9 tables)

This paper contains 25 sections, 8 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Visualization of a motivating example. (a) Detections obtained by the ReDet han2021redet have erroneous identifications: the upper left image shows a false detection of ship on top of an airplane; the bottom left image shows an incorrect airplane detection with unrealistic size. (b) Improved results obtained by our method. The false positives are effectively addressed. Highlighting the importance of considering inter-object relationship and background context in detection.
  • Figure 2: Overview of our model. The model utilizes a two-stage detection framework where features are converted into RoI tokens. Multi-scale patches generate CLIP tokens to capture background context, as shown in (a). Both tokens are processed by a Transformer encoder (b) with spatially aware attention, enhancing inter-object relationships based on distance and scale. Self-supervised constraints on CLIP tokens aid background classification, leading to refined detections with supervised signals $L_{\text{reg}} + L_{\text{cls}}$ and $L_{\text{self}}$.
  • Figure 3: Qualitative comparison of our model and ReDet.
  • Figure 4: Count of outliers (in log-scale) for each category on the test dataset.
  • Figure 5: An example of a failure case: The ship detections are all incorrect, but they reinforce each other, leading to an increased number of false ship detections.