Table of Contents
Fetching ...

Label-Efficient Object Detection via Region Proposal Network Pre-Training

Nanqing Dong, Linus Ericsson, Yongxin Yang, Ales Leonardis, Steven McDonagh

TL;DR

This work addresses the localization bottleneck in self-supervised object detection by pre-training the region proposal network (RPN) with unsupervised region proposals and aligning detector-head pre-training with the RPN. The proposed ADePT framework enables end-to-end self-supervised pre-training of a two-stage detector, incorporating separate and joint training strategies and BYOL-style contrastive objectives. Empirical results across COCO, SODA10M, and PASCAL VOC show that RPN pre-training reduces localization errors and provides noticeable gains, particularly in label-scarce and few-shot scenarios, with strong performance on challenging domain shifts. The findings highlight the value of aligning pretext tasks with localization components, suggesting practical benefits for label-efficient object detection and future extensions to improve unsupervised proposals and instance segmentation.

Abstract

Self-supervised pre-training, based on the pretext task of instance discrimination, has fueled the recent advance in label-efficient object detection. However, existing studies focus on pre-training only a feature extractor network to learn transferable representations for downstream detection tasks. This leads to the necessity of training multiple detection-specific modules from scratch in the fine-tuning phase. We argue that the region proposal network (RPN), a common detection-specific module, can additionally be pre-trained towards reducing the localization error of multi-stage detectors. In this work, we propose a simple pretext task that provides an effective pre-training for the RPN, towards efficiently improving downstream object detection performance. We evaluate the efficacy of our approach on benchmark object detection tasks and additional downstream tasks, including instance segmentation and few-shot detection. In comparison with multi-stage detectors without RPN pre-training, our approach is able to consistently improve downstream task performance, with largest gains found in label-scarce settings.

Label-Efficient Object Detection via Region Proposal Network Pre-Training

TL;DR

This work addresses the localization bottleneck in self-supervised object detection by pre-training the region proposal network (RPN) with unsupervised region proposals and aligning detector-head pre-training with the RPN. The proposed ADePT framework enables end-to-end self-supervised pre-training of a two-stage detector, incorporating separate and joint training strategies and BYOL-style contrastive objectives. Empirical results across COCO, SODA10M, and PASCAL VOC show that RPN pre-training reduces localization errors and provides noticeable gains, particularly in label-scarce and few-shot scenarios, with strong performance on challenging domain shifts. The findings highlight the value of aligning pretext tasks with localization components, suggesting practical benefits for label-efficient object detection and future extensions to improve unsupervised proposals and instance segmentation.

Abstract

Self-supervised pre-training, based on the pretext task of instance discrimination, has fueled the recent advance in label-efficient object detection. However, existing studies focus on pre-training only a feature extractor network to learn transferable representations for downstream detection tasks. This leads to the necessity of training multiple detection-specific modules from scratch in the fine-tuning phase. We argue that the region proposal network (RPN), a common detection-specific module, can additionally be pre-trained towards reducing the localization error of multi-stage detectors. In this work, we propose a simple pretext task that provides an effective pre-training for the RPN, towards efficiently improving downstream object detection performance. We evaluate the efficacy of our approach on benchmark object detection tasks and additional downstream tasks, including instance segmentation and few-shot detection. In comparison with multi-stage detectors without RPN pre-training, our approach is able to consistently improve downstream task performance, with largest gains found in label-scarce settings.
Paper Structure (27 sections, 6 equations, 7 figures, 4 tables)

This paper contains 27 sections, 6 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Localization errors contribute significantly to overall detection error rates. The effect is observed by evaluating recent SSL approaches BYOL grill2020bootstrap, SwAV caron2020unsupervised, SoCo wei2021aligning. Our ADePT method reduces the dominant localization error term. See text for further details.
  • Figure 2: Qualitative comparison using SODA10M han2021soda10m. Top row: ADePT is the only method able to detect the partially occluded truck (image left). Bottom row: dark background and low illumination make successful detection challenging. BYOL grill2020bootstrap and SwAV caron2020unsupervised fail to detect all vehicles. SoCo wei2021aligning and ADePT successfully capture all three, yet SoCo hallucinates an object on the right hand side. Best viewed with digital zoom.
  • Figure 3: Diagram of ADePT. We perform self-supervised pre-training of the RPN component (Sec. \ref{['sec:method:rpn']}, \ref{['sec:method:ssl_rpn']}). The feature extractor (backbone + FPN) and detector head are also trained in self-supervised fashion where three augmented views of the same region in the original image are passed through an online branch (blue) and a target branch (purple), to form a contrastive detector loss.
  • Figure 4: Qualitative comparison on MS COCO lin2014microsoft. Top: our models have highest prediction confidence and ADePT (joint) additionally detects a distant person, despite difficult illumination. Bottom: all models capture the large foreground stop-sign. We further detect the "person" object, a small human-face image, attached to the sign. RPN pre-training can improve small object localization.
  • Figure 5: Predictions using MS COCO lin2014microsoft. BYOL grill2020bootstrap fails to detect a sheep (image left). SwAV caron2020unsupervised overcounts sheep by incorrectly predicting a (single) sheep as multiple instances (image middle). SoCo wei2021aligning fails to detect a sheep (image middle) and incorrectly splits a single sheep in two (image right). ADePT models detect all instances with high confidence. Best viewed with digital zoom.
  • ...and 2 more figures