Table of Contents
Fetching ...

FRED: Towards a Full Rotation-Equivariance in Aerial Image Object Detection

Chanho Lee, Jinsu Son, Hyounguk Shon, Yunho Jeon, Junmo Kim

TL;DR

FRED tackles the challenge of rotation-equivariance in aerial oriented object detection by enforcing end-to-end rotation-equivariance through a point-set bounding-box representation and two specialized deformable convolutions. The approach combines Rotation-Equivariant Deformable Convolution (RE-DCN) for localization and Rotation-Invariant DCN (RI-DCN) for classification, guided by a rotation-aligned reference vector and an edge-based stabilization loss. Empirically, FRED achieves competitive results with substantially fewer parameters on DOTA-v1.0 and outperforms state-of-the-art anchor-free methods on DOTA-v1.5, while maintaining robust performance under image rotations. These results suggest meaningful progress toward true non-axis-aligned learning and indicate potential for unsupervised pose estimation through rotation-equivariant representations.

Abstract

Rotation-equivariance is an essential yet challenging property in oriented object detection. While general object detectors naturally leverage robustness to spatial shifts due to the translation-equivariance of the conventional CNNs, achieving rotation-equivariance remains an elusive goal. Current detectors deploy various alignment techniques to derive rotation-invariant features, but still rely on high capacity models and heavy data augmentation with all possible rotations. In this paper, we introduce a Fully Rotation-Equivariant Oriented Object Detector (FRED), whose entire process from the image to the bounding box prediction is strictly equivariant. Specifically, we decouple the invariant task (object classification) and the equivariant task (object localization) to achieve end-to-end equivariance. We represent the bounding box as a set of rotation-equivariant vectors to implement rotation-equivariant localization. Moreover, we utilized these rotation-equivariant vectors as offsets in the deformable convolution, thereby enhancing the existing advantages of spatial adaptation. Leveraging full rotation-equivariance, our FRED demonstrates higher robustness to image-level rotation compared to existing methods. Furthermore, we show that FRED is one step closer to non-axis aligned learning through our experiments. Compared to state-of-the-art methods, our proposed method delivers comparable performance on DOTA-v1.0 and outperforms by 1.5 mAP on DOTA-v1.5, all while significantly reducing the model parameters to 16%.

FRED: Towards a Full Rotation-Equivariance in Aerial Image Object Detection

TL;DR

FRED tackles the challenge of rotation-equivariance in aerial oriented object detection by enforcing end-to-end rotation-equivariance through a point-set bounding-box representation and two specialized deformable convolutions. The approach combines Rotation-Equivariant Deformable Convolution (RE-DCN) for localization and Rotation-Invariant DCN (RI-DCN) for classification, guided by a rotation-aligned reference vector and an edge-based stabilization loss. Empirically, FRED achieves competitive results with substantially fewer parameters on DOTA-v1.0 and outperforms state-of-the-art anchor-free methods on DOTA-v1.5, while maintaining robust performance under image rotations. These results suggest meaningful progress toward true non-axis-aligned learning and indicate potential for unsupervised pose estimation through rotation-equivariant representations.

Abstract

Rotation-equivariance is an essential yet challenging property in oriented object detection. While general object detectors naturally leverage robustness to spatial shifts due to the translation-equivariance of the conventional CNNs, achieving rotation-equivariance remains an elusive goal. Current detectors deploy various alignment techniques to derive rotation-invariant features, but still rely on high capacity models and heavy data augmentation with all possible rotations. In this paper, we introduce a Fully Rotation-Equivariant Oriented Object Detector (FRED), whose entire process from the image to the bounding box prediction is strictly equivariant. Specifically, we decouple the invariant task (object classification) and the equivariant task (object localization) to achieve end-to-end equivariance. We represent the bounding box as a set of rotation-equivariant vectors to implement rotation-equivariant localization. Moreover, we utilized these rotation-equivariant vectors as offsets in the deformable convolution, thereby enhancing the existing advantages of spatial adaptation. Leveraging full rotation-equivariance, our FRED demonstrates higher robustness to image-level rotation compared to existing methods. Furthermore, we show that FRED is one step closer to non-axis aligned learning through our experiments. Compared to state-of-the-art methods, our proposed method delivers comparable performance on DOTA-v1.0 and outperforms by 1.5 mAP on DOTA-v1.5, all while significantly reducing the model parameters to 16%.
Paper Structure (18 sections, 7 equations, 6 figures, 4 tables)

This paper contains 18 sections, 7 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of the fully rotation-equivariant object detector (FRED). FRED consists of a rotation-equivariant backbone which predicts a point set followed by two prediction branches -- (1) a rotation-equivariant box regression head and (2) a rotation-invariant classification head. We achieve end-to-end equivariance for object detection.
  • Figure 2: Overall model architecture of the proposed Fully Rotation-Equivariant Detector (FRED).$C_N$-equivariant features are fed into the rotation-equivariant head up to two deformable convolution blocks. The Rotation-Equivariant Deformable Convolution (RE-DCN) tilizes an initial point set as an offset and refines it through spatial adaptation without breaking rotation-equivariance. The Rotation-Invariant Deformable Convolution (RI-DCN) performs an orientation alignment to produce rotation-invariant features using an align reference vector sourced from the localization branch. As both the deformable offsets and the reference vector maintain rotation-equivariance, the classification branch achieves instance-level rotation-invariance.
  • Figure 3: This example illustrates a 4-equivariant rotation group ($C_4$) and 2x2 deformable kernel for simplicity. The deformable convolution (DCN) layer parameters are shared between rotation groups.
  • Figure 4: Robustness against rotation estimated with DOTA-v1.0. We compared the performance degradation of various models as the image rotates. Excluding discrete rotations at 90-degree intervals, the loss of rotated image information always results a decreased mAP.
  • Figure 5: Examples of detection results using FRED on DOTA-v1.5
  • ...and 1 more figures