Table of Contents
Fetching ...

Wholly-WOOD: Wholly Leveraging Diversified-quality Labels for Weakly-supervised Oriented Object Detection

Yi Yu, Xue Yang, Yansheng Li, Zhenjun Han, Feipeng Da, Junchi Yan

TL;DR

Wholly-WOOD tackles the high cost of rotated bounding box annotations by unifying Point, HBox, and RBox supervision into a single weakly-supervised framework for oriented object detection. It combines symmetry-aware learning to infer orientation with a Point-to-RBox knowledge-aggregation module, producing accurate RBoxes from diverse labels. Key contributions include a theory of symmetry-based angle estimation, H2RBox-v2 and Point2RBox extensions, and the integrated Wholly-WOOD system with a P2R subnet, achieving near-parity with fully supervised baselines while reducing labeling effort. The approach demonstrates strong results on remote-sensing datasets and shows potential for broader applicability, with open-source PyTorch/Jittor implementations provided.

Abstract

Accurately estimating the orientation of visual objects with compact rotated bounding boxes (RBoxes) has become a prominent demand, which challenges existing object detection paradigms that only use horizontal bounding boxes (HBoxes). To equip the detectors with orientation awareness, supervised regression/classification modules have been introduced at the high cost of rotation annotation. Meanwhile, some existing datasets with oriented objects are already annotated with horizontal boxes or even single points. It becomes attractive yet remains open for effectively utilizing weaker single point and horizontal annotations to train an oriented object detector (OOD). We develop Wholly-WOOD, a weakly-supervised OOD framework, capable of wholly leveraging various labeling forms (Points, HBoxes, RBoxes, and their combination) in a unified fashion. By only using HBox for training, our Wholly-WOOD achieves performance very close to that of the RBox-trained counterpart on remote sensing and other areas, significantly reducing the tedious efforts on labor-intensive annotation for oriented objects. The source codes are available at https://github.com/VisionXLab/whollywood (PyTorch-based) and https://github.com/VisionXLab/whollywood-jittor (Jittor-based).

Wholly-WOOD: Wholly Leveraging Diversified-quality Labels for Weakly-supervised Oriented Object Detection

TL;DR

Wholly-WOOD tackles the high cost of rotated bounding box annotations by unifying Point, HBox, and RBox supervision into a single weakly-supervised framework for oriented object detection. It combines symmetry-aware learning to infer orientation with a Point-to-RBox knowledge-aggregation module, producing accurate RBoxes from diverse labels. Key contributions include a theory of symmetry-based angle estimation, H2RBox-v2 and Point2RBox extensions, and the integrated Wholly-WOOD system with a P2R subnet, achieving near-parity with fully supervised baselines while reducing labeling effort. The approach demonstrates strong results on remote-sensing datasets and shows potential for broader applicability, with open-source PyTorch/Jittor implementations provided.

Abstract

Accurately estimating the orientation of visual objects with compact rotated bounding boxes (RBoxes) has become a prominent demand, which challenges existing object detection paradigms that only use horizontal bounding boxes (HBoxes). To equip the detectors with orientation awareness, supervised regression/classification modules have been introduced at the high cost of rotation annotation. Meanwhile, some existing datasets with oriented objects are already annotated with horizontal boxes or even single points. It becomes attractive yet remains open for effectively utilizing weaker single point and horizontal annotations to train an oriented object detector (OOD). We develop Wholly-WOOD, a weakly-supervised OOD framework, capable of wholly leveraging various labeling forms (Points, HBoxes, RBoxes, and their combination) in a unified fashion. By only using HBox for training, our Wholly-WOOD achieves performance very close to that of the RBox-trained counterpart on remote sensing and other areas, significantly reducing the tedious efforts on labor-intensive annotation for oriented objects. The source codes are available at https://github.com/VisionXLab/whollywood (PyTorch-based) and https://github.com/VisionXLab/whollywood-jittor (Jittor-based).

Paper Structure

This paper contains 16 sections, 27 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: To illustrate the task we aim at and the results we achieve. (a) Different annotating formats supported by our Wholly-WOOD. (b) Accuracy for each category of remote sensing objects in the DOTA-v1.0 dataset. (c) Green/Orange/Blue bars: the accuracy of Wholly-WOOD using Point/HBox/RBox annotations. Red dashed lines: the accuracy of RBox-trained FCOS Tian2019FCOS serving as a reference to measure the accuracy disparity.
  • Figure 2: The overview of H2RBox-v2. (a) Self-supervised (SS) branch that learns the orientation from the symmetry of objects. (b) Weakly-supervised branch that learns other properties from HBoxes. (c) Snap loss for the SS branch. (d) Circumscribed IoU (CircumIoU) loss for the WS branch.
  • Figure 3: The training flowchart of Point2RBox, consisting of knowledge combination and transform self-supervision. The core idea is to combine knowledge from synthetic patterns for size and angle estimation, and knowledge from annotated points for classification.
  • Figure 4: The Wholly-WOOD architecture. It features two key components: 1) The symmetry-aware learning module utilizes self-supervised learning to extract object orientations based on symmetry; 2) The knowledge combination module integrates a pattern generator, which generates synthetic visual patterns for training the size and angle regression of the P2R subnet. The predictions from the P2R subnet then offer RBox suggestions corresponding to each point.
  • Figure 5: The architecture of P2R subnet. The multiple output layers of the FPN automatically aggregate based on the gating score activated by each layer itself. Meanwhile, the output boxes are scaled by a factor calculated from the gating score. bin2dec($\cdot$) denotes continuous binary-to-decimal (see Eq. \ref{['eq:b2d']}).
  • ...and 4 more figures