Table of Contents
Fetching ...

Spatial Transform Decoupling for Oriented Object Detection

Hongtian Yu, Yunjie Tian, Qixiang Ye, Yunfan Liu

TL;DR

Spatial Transform Decoupling (STD) tackles oriented object detection with Vision Transformers by decoupling bounding-box parameter estimation into separate branches for $x$, $y$, $w$, $h$, and $\alpha$, and by applying Cascaded Activation Masks (CAMs) to progressively refine RoI features. The method integrates with ViT-based detectors in a layer-wise, hierarchical manner and demonstrates strong, state-of-the-art performance on remote-sensing benchmarks such as DOTA-v1.0 and HRSC2016. Key contributions include the multi-branch parameter prediction design, CAM-enhanced self-attention (TBAM), and extensive ablations confirming the design choices and generalizability to different backbones and detectors. The results indicate STD provides a robust, practical approach to oriented object detection with transformers, supported by a detailed spatial-transform derivation and broad supplementary experiments in the Appendix.

Abstract

Vision Transformers (ViTs) have achieved remarkable success in computer vision tasks. However, their potential in rotation-sensitive scenarios has not been fully explored, and this limitation may be inherently attributed to the lack of spatial invariance in the data-forwarding process. In this study, we present a novel approach, termed Spatial Transform Decoupling (STD), providing a simple-yet-effective solution for oriented object detection with ViTs. Built upon stacked ViT blocks, STD utilizes separate network branches to predict the position, size, and angle of bounding boxes, effectively harnessing the spatial transform potential of ViTs in a divide-and-conquer fashion. Moreover, by aggregating cascaded activation masks (CAMs) computed upon the regressed parameters, STD gradually enhances features within regions of interest (RoIs), which complements the self-attention mechanism. Without bells and whistles, STD achieves state-of-the-art performance on the benchmark datasets including DOTA-v1.0 (82.24% mAP) and HRSC2016 (98.55% mAP), which demonstrates the effectiveness of the proposed method. Source code is available at https://github.com/yuhongtian17/Spatial-Transform-Decoupling.

Spatial Transform Decoupling for Oriented Object Detection

TL;DR

Spatial Transform Decoupling (STD) tackles oriented object detection with Vision Transformers by decoupling bounding-box parameter estimation into separate branches for , , , , and , and by applying Cascaded Activation Masks (CAMs) to progressively refine RoI features. The method integrates with ViT-based detectors in a layer-wise, hierarchical manner and demonstrates strong, state-of-the-art performance on remote-sensing benchmarks such as DOTA-v1.0 and HRSC2016. Key contributions include the multi-branch parameter prediction design, CAM-enhanced self-attention (TBAM), and extensive ablations confirming the design choices and generalizability to different backbones and detectors. The results indicate STD provides a robust, practical approach to oriented object detection with transformers, supported by a detailed spatial-transform derivation and broad supplementary experiments in the Appendix.

Abstract

Vision Transformers (ViTs) have achieved remarkable success in computer vision tasks. However, their potential in rotation-sensitive scenarios has not been fully explored, and this limitation may be inherently attributed to the lack of spatial invariance in the data-forwarding process. In this study, we present a novel approach, termed Spatial Transform Decoupling (STD), providing a simple-yet-effective solution for oriented object detection with ViTs. Built upon stacked ViT blocks, STD utilizes separate network branches to predict the position, size, and angle of bounding boxes, effectively harnessing the spatial transform potential of ViTs in a divide-and-conquer fashion. Moreover, by aggregating cascaded activation masks (CAMs) computed upon the regressed parameters, STD gradually enhances features within regions of interest (RoIs), which complements the self-attention mechanism. Without bells and whistles, STD achieves state-of-the-art performance on the benchmark datasets including DOTA-v1.0 (82.24% mAP) and HRSC2016 (98.55% mAP), which demonstrates the effectiveness of the proposed method. Source code is available at https://github.com/yuhongtian17/Spatial-Transform-Decoupling.
Paper Structure (31 sections, 5 equations, 7 figures, 9 tables)

This paper contains 31 sections, 5 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Conventional approaches (upper) estimate the position, size, and angle using a single RoI feature. In contrast, STD (lower) predicts and refines the parameters of bounding boxes in a divide-and-conquer (decoupled) manner.
  • Figure 2: The framework of the proposed Spatial Transform Decoupling (STD) method. The detailed structure of Transformer blocks integrated with activation masks (TBAM) is shown on the left.
  • Figure 3: The translation between the predicted bounding box and the activation mask after affine transformation. The blue box represents the proposal region and the red box represents the activation mask.
  • Figure 4: Visualization of attention maps. Compare to the baseline Transformer, the attention maps in STD (bk1 to bk4) exhibit a stronger alignment with the semantic interpretation of the parameter estimated at the respective stage.
  • Figure 5: Comparison of detection results. STD demonstrates superior performance in reducing false detections ((a), (b), and (c)), better discerning clustered objects ((c) and (e)), and improving the alignment with oriented objects ((c), (d), and (e)).
  • ...and 2 more figures