Table of Contents
Fetching ...

Rethinking Annotation for Object Detection: Is Annotating Small-size Instances Worth Its Cost?

Yusuke Hosoya, Masanori Suganuma, Takayuki Okatani

TL;DR

Annotating small-size object instances is costly and error-prone. The authors test whether a detector can identify small objects without small-size annotations by Upscale-at-test-time (Up@Test) or Downscale-at-training-time (Down@Train) on COCO, and further distill Up@Test into a single-path model. Results show that Up@Test with domain-gap mitigation can match or exceed baselines trained on full data, while Down@Train underperforms due to domain gaps; the distillation step yields a practical single-path detector with comparable performance across size ranges. These findings suggest that precise annotation of small objects may be less essential than commonly believed, potentially reducing labeling costs while preserving detection accuracy, thanks to scalable input resizing and knowledge distillation.

Abstract

Detecting objects occupying only small areas in an image is difficult, even for humans. Therefore, annotating small-size object instances is hard and thus costly. This study questions common sense by asking the following: is annotating small-size instances worth its cost? We restate it as the following verifiable question: can we detect small-size instances with a detector trained using training data free of small-size instances? We evaluate a method that upscales input images at test time and a method that downscales images at training time. The experiments conducted using the COCO dataset show the following. The first method, together with a remedy to narrow the domain gap between training and test inputs, achieves at least comparable performance to the baseline detector trained using complete training data. Although the method needs to apply the same detector twice to an input image with different scaling, we show that its distillation yields a single-path detector that performs equally well to the same baseline detector. These results point to the necessity of rethinking the annotation of training data for object detection.

Rethinking Annotation for Object Detection: Is Annotating Small-size Instances Worth Its Cost?

TL;DR

Annotating small-size object instances is costly and error-prone. The authors test whether a detector can identify small objects without small-size annotations by Upscale-at-test-time (Up@Test) or Downscale-at-training-time (Down@Train) on COCO, and further distill Up@Test into a single-path model. Results show that Up@Test with domain-gap mitigation can match or exceed baselines trained on full data, while Down@Train underperforms due to domain gaps; the distillation step yields a practical single-path detector with comparable performance across size ranges. These findings suggest that precise annotation of small objects may be less essential than commonly believed, potentially reducing labeling costs while preserving detection accuracy, thanks to scalable input resizing and knowledge distillation.

Abstract

Detecting objects occupying only small areas in an image is difficult, even for humans. Therefore, annotating small-size object instances is hard and thus costly. This study questions common sense by asking the following: is annotating small-size instances worth its cost? We restate it as the following verifiable question: can we detect small-size instances with a detector trained using training data free of small-size instances? We evaluate a method that upscales input images at test time and a method that downscales images at training time. The experiments conducted using the COCO dataset show the following. The first method, together with a remedy to narrow the domain gap between training and test inputs, achieves at least comparable performance to the baseline detector trained using complete training data. Although the method needs to apply the same detector twice to an input image with different scaling, we show that its distillation yields a single-path detector that performs equally well to the same baseline detector. These results point to the necessity of rethinking the annotation of training data for object detection.

Paper Structure

This paper contains 23 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Compared with mid- and large-size object instances (upper row), small-size instances are harder to annotate correctly (lower row).
  • Figure 2: Examples showing difficulties with annotating small-size object instances. (a) Images that are hard to annotate; while careful inspection reveals several BBs are missing, some are really hard to judge. Only small-size instances are visualized. (b) In the COCO dataset, annotators can specify bounding boxes with a binary label "iscrowd" (shown in green) to enclose a crowd of objects; they are excluded from both training and evaluation.
  • Figure 3: "Accuracy" of a human subject's annotation on selected 500 images from the COCO validation split, plotted in colored dots. The precision-recall curves of SOTA object detector, EfficientDet-D7 EFDet, are shown for comparison. Evaluation with two definitions of correct detection (IOU threshold with $0.5$ and $0.7$).
  • Figure 4: Three methods compared in the experiments. (a) The ordinary method. A model is trained with and applied to images with the native resolution for all size instances. (b) Up@Test: upscaling input images at test time. A model is trained only with medium- and large-size instances. It detects small-size instances from upscaled images and others from the original images. (c) Down@Train: downscaling input images at training time. A model is trained with images with the native resolution to detect medium and large instances and trained with downscaled images to detect small-size instances. It detects all size instances from input images with the native resolution.
  • Figure 5: Results of Up@Test-vanilla, i.e., trained with the original images and with only medium- and large-size instances. The left and right show Faster RCNN and FCOS, respectively. The horizontal axis indicates the factor of the test-time upscaling. The broken lines indicate Baseline.
  • ...and 4 more figures