Table of Contents
Fetching ...

OrientedFormer: An End-to-End Transformer-Based Oriented Object Detector in Remote Sensing Images

Jiaqi Zhao, Zeyu Ding, Yong Zhou, Hancheng Zhu, Wen-Liang Du, Rui Yao, Abdulmotaleb El Saddik

TL;DR

OrientedFormer tackles the challenge of end-to-end oriented object detection in remote sensing by introducing three dedicated components: Gaussian positional encoding to unify angle, position, and size; Wasserstein self-attention to infuse geometric relations into self-attention; and oriented cross-attention to align values with rotationally aware sampling. The approach enables one-to-one label assignment and end-to-end training while improving detection accuracy across six benchmarks, achieving state-of-the-art results on several datasets and faster convergence than prior DETR-like methods. Notably, OrientedFormer delivers AP_{50} gains of approximately 1.16–1.21 points on DIOR-R and DOTA-v1.0, while reducing training epochs from 3× to 1×. These findings demonstrate the viability and effectiveness of transformer-based oriented object detectors in diverse remote sensing scenarios, with potential impact on both research and practical detection pipelines.

Abstract

Oriented object detection in remote sensing images is a challenging task due to objects being distributed in multi-orientation. Recently, end-to-end transformer-based methods have achieved success by eliminating the need for post-processing operators compared to traditional CNN-based methods. However, directly extending transformers to oriented object detection presents three main issues: 1) objects rotate arbitrarily, necessitating the encoding of angles along with position and size; 2) the geometric relations of oriented objects are lacking in self-attention, due to the absence of interaction between content and positional queries; and 3) oriented objects cause misalignment, mainly between values and positional queries in cross-attention, making accurate classification and localization difficult. In this paper, we propose an end-to-end transformer-based oriented object detector, consisting of three dedicated modules to address these issues. First, Gaussian positional encoding is proposed to encode the angle, position, and size of oriented boxes using Gaussian distributions. Second, Wasserstein self-attention is proposed to introduce geometric relations and facilitate interaction between content and positional queries by utilizing Gaussian Wasserstein distance scores. Third, oriented cross-attention is proposed to align values and positional queries by rotating sampling points around the positional query according to their angles. Experiments on six datasets DIOR-R, a series of DOTA, HRSC2016 and ICDAR2015 show the effectiveness of our approach. Compared with previous end-to-end detectors, the OrientedFormer gains 1.16 and 1.21 AP$_{50}$ on DIOR-R and DOTA-v1.0 respectively, while reducing training epochs from 3$\times$ to 1$\times$. The codes are available at https://github.com/wokaikaixinxin/OrientedFormer.

OrientedFormer: An End-to-End Transformer-Based Oriented Object Detector in Remote Sensing Images

TL;DR

OrientedFormer tackles the challenge of end-to-end oriented object detection in remote sensing by introducing three dedicated components: Gaussian positional encoding to unify angle, position, and size; Wasserstein self-attention to infuse geometric relations into self-attention; and oriented cross-attention to align values with rotationally aware sampling. The approach enables one-to-one label assignment and end-to-end training while improving detection accuracy across six benchmarks, achieving state-of-the-art results on several datasets and faster convergence than prior DETR-like methods. Notably, OrientedFormer delivers AP_{50} gains of approximately 1.16–1.21 points on DIOR-R and DOTA-v1.0, while reducing training epochs from 3× to 1×. These findings demonstrate the viability and effectiveness of transformer-based oriented object detectors in diverse remote sensing scenarios, with potential impact on both research and practical detection pipelines.

Abstract

Oriented object detection in remote sensing images is a challenging task due to objects being distributed in multi-orientation. Recently, end-to-end transformer-based methods have achieved success by eliminating the need for post-processing operators compared to traditional CNN-based methods. However, directly extending transformers to oriented object detection presents three main issues: 1) objects rotate arbitrarily, necessitating the encoding of angles along with position and size; 2) the geometric relations of oriented objects are lacking in self-attention, due to the absence of interaction between content and positional queries; and 3) oriented objects cause misalignment, mainly between values and positional queries in cross-attention, making accurate classification and localization difficult. In this paper, we propose an end-to-end transformer-based oriented object detector, consisting of three dedicated modules to address these issues. First, Gaussian positional encoding is proposed to encode the angle, position, and size of oriented boxes using Gaussian distributions. Second, Wasserstein self-attention is proposed to introduce geometric relations and facilitate interaction between content and positional queries by utilizing Gaussian Wasserstein distance scores. Third, oriented cross-attention is proposed to align values and positional queries by rotating sampling points around the positional query according to their angles. Experiments on six datasets DIOR-R, a series of DOTA, HRSC2016 and ICDAR2015 show the effectiveness of our approach. Compared with previous end-to-end detectors, the OrientedFormer gains 1.16 and 1.21 AP on DIOR-R and DOTA-v1.0 respectively, while reducing training epochs from 3 to 1. The codes are available at https://github.com/wokaikaixinxin/OrientedFormer.
Paper Structure (27 sections, 19 equations, 15 figures, 16 tables, 1 algorithm)

This paper contains 27 sections, 19 equations, 15 figures, 16 tables, 1 algorithm.

Figures (15)

  • Figure 1: (a) Object instances distribute in remote sensing images with arbitrary orientation. Angles are used to characterize oriented objects, in addition to positionsa and sizes. (b) Visualization of sampling points of the oriented cross-attention for alignment.
  • Figure 2: Overall architecture of the OrientedFormer. Features are extracted from images. An object query is decomposed into a content query $Q_{c}$ and a positional query $Q_{p}$. The Gaussian PE encodes positional queries. The Wasserstein self-attention measures the geometric relations between two different content queries by utilizing Wasserstein distance scores. The oriented cross-attention is proposed to align values and positional queries.
  • Figure 3: An example of Gaussian positional encoding. (a) positional encoding in Deformable DETR. (b) Gaussian positional encoding.
  • Figure 4: Self-attention in the decoder. (a) vanilla self-attention. (b)Wasserstein self-attention.
  • Figure 5: Oriented cross-attention. It attends to sparse sampling points $(\tilde{x},\tilde{y},\tilde{z})$ around the center of a positional query. Sampling points are rotated according to angles for alignment. Values $V$ are interpolated by sampling points and multi-scale features. We deploy attention mechanisms separately on each particular dimension of values, i.e., scale-aware, channel-aware, and spatial-aware.
  • ...and 10 more figures