Spatial Transform Decoupling for Oriented Object Detection
Hongtian Yu, Yunjie Tian, Qixiang Ye, Yunfan Liu
TL;DR
Spatial Transform Decoupling (STD) tackles oriented object detection with Vision Transformers by decoupling bounding-box parameter estimation into separate branches for $x$, $y$, $w$, $h$, and $\alpha$, and by applying Cascaded Activation Masks (CAMs) to progressively refine RoI features. The method integrates with ViT-based detectors in a layer-wise, hierarchical manner and demonstrates strong, state-of-the-art performance on remote-sensing benchmarks such as DOTA-v1.0 and HRSC2016. Key contributions include the multi-branch parameter prediction design, CAM-enhanced self-attention (TBAM), and extensive ablations confirming the design choices and generalizability to different backbones and detectors. The results indicate STD provides a robust, practical approach to oriented object detection with transformers, supported by a detailed spatial-transform derivation and broad supplementary experiments in the Appendix.
Abstract
Vision Transformers (ViTs) have achieved remarkable success in computer vision tasks. However, their potential in rotation-sensitive scenarios has not been fully explored, and this limitation may be inherently attributed to the lack of spatial invariance in the data-forwarding process. In this study, we present a novel approach, termed Spatial Transform Decoupling (STD), providing a simple-yet-effective solution for oriented object detection with ViTs. Built upon stacked ViT blocks, STD utilizes separate network branches to predict the position, size, and angle of bounding boxes, effectively harnessing the spatial transform potential of ViTs in a divide-and-conquer fashion. Moreover, by aggregating cascaded activation masks (CAMs) computed upon the regressed parameters, STD gradually enhances features within regions of interest (RoIs), which complements the self-attention mechanism. Without bells and whistles, STD achieves state-of-the-art performance on the benchmark datasets including DOTA-v1.0 (82.24% mAP) and HRSC2016 (98.55% mAP), which demonstrates the effectiveness of the proposed method. Source code is available at https://github.com/yuhongtian17/Spatial-Transform-Decoupling.
