Rethinking Annotation for Object Detection: Is Annotating Small-size Instances Worth Its Cost?
Yusuke Hosoya, Masanori Suganuma, Takayuki Okatani
TL;DR
Annotating small-size object instances is costly and error-prone. The authors test whether a detector can identify small objects without small-size annotations by Upscale-at-test-time (Up@Test) or Downscale-at-training-time (Down@Train) on COCO, and further distill Up@Test into a single-path model. Results show that Up@Test with domain-gap mitigation can match or exceed baselines trained on full data, while Down@Train underperforms due to domain gaps; the distillation step yields a practical single-path detector with comparable performance across size ranges. These findings suggest that precise annotation of small objects may be less essential than commonly believed, potentially reducing labeling costs while preserving detection accuracy, thanks to scalable input resizing and knowledge distillation.
Abstract
Detecting objects occupying only small areas in an image is difficult, even for humans. Therefore, annotating small-size object instances is hard and thus costly. This study questions common sense by asking the following: is annotating small-size instances worth its cost? We restate it as the following verifiable question: can we detect small-size instances with a detector trained using training data free of small-size instances? We evaluate a method that upscales input images at test time and a method that downscales images at training time. The experiments conducted using the COCO dataset show the following. The first method, together with a remedy to narrow the domain gap between training and test inputs, achieves at least comparable performance to the baseline detector trained using complete training data. Although the method needs to apply the same detector twice to an input image with different scaling, we show that its distillation yields a single-path detector that performs equally well to the same baseline detector. These results point to the necessity of rethinking the annotation of training data for object detection.
