Table of Contents
Fetching ...

Tiny Object Detection with Single Point Supervision

Haoran Zhu, Chang Xu, Ruixiang Zhang, Fang Xu, Wen Yang, Haijian Zhang, Gui-Song Xia

TL;DR

This work tackles the high-cost challenge of bounding-box annotation for tiny objects by enabling end-to-end detection with single-point supervision. It introduces Point Teacher, a two-phase denoising framework that converts noisy point annotations into accurate pseudo boxes through Spatial-aware Box Generation and Noise-aware Label Evolution, aided by Dynamic Multiple Instance Learning and a Jittering IoU Loss. The method uses a teacher-student setup and a top-K point-matching mechanism to progressively refine boxes, and it integrates with detectors via a Top-down FPN Aggregation and Scale-invariant Label Assignment, supporting both horizontal and oriented boxes. Experiments on AI-TOD-v2.0, SODA-A, and TinyPerson show substantial gains over existing PSOD approaches and competitive performance with box-supervised methods, highlighting robustness to point-location noise and potential for large-scale, annotation-efficient tiny object detection in aerial imagery.

Abstract

Tiny objects, with their limited spatial resolution, often resemble point-like distributions. As a result, bounding box prediction using point-level supervision emerges as a natural and cost-effective alternative to traditional box-level supervision. However, the small scale and lack of distinctive features of tiny objects make point annotations prone to noise, posing significant hurdles for model robustness. To tackle these challenges, we propose Point Teacher--the first end-to-end point-supervised method for robust tiny object detection in aerial images. To handle label noise from scale ambiguity and location shifts in point annotations, Point Teacher employs the teacher-student architecture and decouples the learning into a two-phase denoising process. In this framework, the teacher network progressively denoises the pseudo boxes derived from noisy point annotations, guiding the student network's learning. Specifically, in the first phase, random masking of image regions facilitates regression learning, enabling the teacher to transform noisy point annotations into coarse pseudo boxes. In the second phase, these coarse pseudo boxes are refined using dynamic multiple instance learning, which adaptively selects the most reliable instance from dynamically constructed proposal bags around the coarse pseudo boxes. Extensive experiments on three tiny object datasets (i.e., AI-TOD-v2, SODA-A, and TinyPerson) validate the proposed method's effectiveness and robustness against point location shifts. Notably, relying solely on point supervision, our Point Teacher already shows comparable performance with box-supervised learning methods. Codes and models will be made publicly available.

Tiny Object Detection with Single Point Supervision

TL;DR

This work tackles the high-cost challenge of bounding-box annotation for tiny objects by enabling end-to-end detection with single-point supervision. It introduces Point Teacher, a two-phase denoising framework that converts noisy point annotations into accurate pseudo boxes through Spatial-aware Box Generation and Noise-aware Label Evolution, aided by Dynamic Multiple Instance Learning and a Jittering IoU Loss. The method uses a teacher-student setup and a top-K point-matching mechanism to progressively refine boxes, and it integrates with detectors via a Top-down FPN Aggregation and Scale-invariant Label Assignment, supporting both horizontal and oriented boxes. Experiments on AI-TOD-v2.0, SODA-A, and TinyPerson show substantial gains over existing PSOD approaches and competitive performance with box-supervised methods, highlighting robustness to point-location noise and potential for large-scale, annotation-efficient tiny object detection in aerial imagery.

Abstract

Tiny objects, with their limited spatial resolution, often resemble point-like distributions. As a result, bounding box prediction using point-level supervision emerges as a natural and cost-effective alternative to traditional box-level supervision. However, the small scale and lack of distinctive features of tiny objects make point annotations prone to noise, posing significant hurdles for model robustness. To tackle these challenges, we propose Point Teacher--the first end-to-end point-supervised method for robust tiny object detection in aerial images. To handle label noise from scale ambiguity and location shifts in point annotations, Point Teacher employs the teacher-student architecture and decouples the learning into a two-phase denoising process. In this framework, the teacher network progressively denoises the pseudo boxes derived from noisy point annotations, guiding the student network's learning. Specifically, in the first phase, random masking of image regions facilitates regression learning, enabling the teacher to transform noisy point annotations into coarse pseudo boxes. In the second phase, these coarse pseudo boxes are refined using dynamic multiple instance learning, which adaptively selects the most reliable instance from dynamically constructed proposal bags around the coarse pseudo boxes. Extensive experiments on three tiny object datasets (i.e., AI-TOD-v2, SODA-A, and TinyPerson) validate the proposed method's effectiveness and robustness against point location shifts. Notably, relying solely on point supervision, our Point Teacher already shows comparable performance with box-supervised learning methods. Codes and models will be made publicly available.

Paper Structure

This paper contains 17 sections, 15 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: (a) Effect of point location on accuracy: previous methods assume that the point location lies within the center region, and performance significantly degrades when the point location slights shifts around the center. (b) Comparison of point annotations for large and tiny objects: the limited scale and ambiguous boundaries make it challenging to annotate accurately on the main body of the tiny object. (c) An overview of our proposed Point Teacher: we propose a one-step two-phase learning paradigm that is robust to point location, consisting of Spatial-aware Box Generation and Noise-aware Label Evolution.
  • Figure 2: A comparison with existing point-supervised object detection methods, including (a) MIL-based methods; (b) CPM-based methods; (c) Auxiliary-based methods; (d) Denoising-based methods. (a), (b), and (c) paradigms adopt a two-step, non-end-to-end training process. (d) paradigm adopts a one-step, two-phase end-to-end training process. SAM denotes Segment Anything Model.
  • Figure 3: The framework of Point Teacher. The training process of Point Teacher consists of two phases: Spatial-aware Box Generation (phase1) and Noise-aware Label Evolution (phase2). During the Spatial-aware Box Generation phase, the masked image is used to train both the regression branch and the DMIL module, enabling the model to develop spatial awareness. In the Noise-aware Label Evolution phase, the teacher network, in conjunction with the DMIL module, generates clean pseudo boxes to supervise the student network for end-to-end learning. The classification learning is integrated throughout the phases.
  • Figure 4: The workflow of Dynamic Multiple Instance Learning Module (DMIL). DMIL comprises four stages: Bag Construction, Bag Extension, Bag Classifier, and Instance Selection. The Bag Construction and Bag Extension stages ensure the creation of high-quality bags. The Bag Classifier is used to train the classification and discrimination capabilities of the bags. Finally, Instance Selection uses the scores from the Bag Classifier to merge instances, generating pseudo boxes.
  • Figure 5: Visualization of pseudo boxes generated by the DMIL Module. Green boxes denote the gt boxes, yellow boxes denote the pseudo boxes generated by DMIL.
  • ...and 1 more figures