Table of Contents
Fetching ...

Aligned Unsupervised Pretraining of Object Detectors with Self-training

Ioannis Maniadis Metaxas, Adrian Bulat, Ioannis Patras, Brais Martinez, Georgios Tzimiropoulos

TL;DR

AptDet addresses the misalignment between unsupervised detector pretraining and downstream class-aware detection by introducing an end-to-end framework that uses high-level semantic proposals, clustering-based pseudo-labels, and self-training to iteratively refine supervision. It replaces low-level, auxiliary objectives with a principled alignment between pretext and downstream tasks, enabling joint pretraining of the backbone and detector and even training from scratch on scene-centric data. Across DETR-based and R-CNN architectures, AptDet achieves state-of-the-art results on standard benchmarks, demonstrates strong performance in low-data and few-shot regimes, and shows versatility for self-supervised backbone pretraining. The approach simplifies the training pipeline, reduces reliance on complex auxiliary losses, and offers practical avenues for unsupervised representation learning directly from detection as a pretext task.

Abstract

The unsupervised pretraining of object detectors has recently become a key component of object detector training, as it leads to improved performance and faster convergence during the supervised fine-tuning stage. Existing unsupervised pretraining methods, however, typically rely on low-level information to define proposals that are used to train the detector. Furthermore, in the absence of class labels for these proposals, an auxiliary loss is used to add high-level semantics. This results in complex pipelines and a task gap between the pretraining and the downstream task. We propose a framework that mitigates this issue and consists of three simple yet key ingredients: (i) richer initial proposals that do encode high-level semantics, (ii) class pseudo-labeling through clustering, that enables pretraining using a standard object detection training pipeline, (iii) self-training to iteratively improve and enrich the object proposals. Once the pretraining and downstream tasks are aligned, a simple detection pipeline without further bells and whistles can be directly used for pretraining and, in fact, results in state-of-the-art performance on both the full and low data regimes, across detector architectures and datasets, by significant margins. We further show that our pretraining strategy is also capable of pretraining from scratch (including the backbone) and works on complex images like COCO, paving the path for unsupervised representation learning using object detection directly as a pretext task.

Aligned Unsupervised Pretraining of Object Detectors with Self-training

TL;DR

AptDet addresses the misalignment between unsupervised detector pretraining and downstream class-aware detection by introducing an end-to-end framework that uses high-level semantic proposals, clustering-based pseudo-labels, and self-training to iteratively refine supervision. It replaces low-level, auxiliary objectives with a principled alignment between pretext and downstream tasks, enabling joint pretraining of the backbone and detector and even training from scratch on scene-centric data. Across DETR-based and R-CNN architectures, AptDet achieves state-of-the-art results on standard benchmarks, demonstrates strong performance in low-data and few-shot regimes, and shows versatility for self-supervised backbone pretraining. The approach simplifies the training pipeline, reduces reliance on complex auxiliary losses, and offers practical avenues for unsupervised representation learning directly from detection as a pretext task.

Abstract

The unsupervised pretraining of object detectors has recently become a key component of object detector training, as it leads to improved performance and faster convergence during the supervised fine-tuning stage. Existing unsupervised pretraining methods, however, typically rely on low-level information to define proposals that are used to train the detector. Furthermore, in the absence of class labels for these proposals, an auxiliary loss is used to add high-level semantics. This results in complex pipelines and a task gap between the pretraining and the downstream task. We propose a framework that mitigates this issue and consists of three simple yet key ingredients: (i) richer initial proposals that do encode high-level semantics, (ii) class pseudo-labeling through clustering, that enables pretraining using a standard object detection training pipeline, (iii) self-training to iteratively improve and enrich the object proposals. Once the pretraining and downstream tasks are aligned, a simple detection pipeline without further bells and whistles can be directly used for pretraining and, in fact, results in state-of-the-art performance on both the full and low data regimes, across detector architectures and datasets, by significant margins. We further show that our pretraining strategy is also capable of pretraining from scratch (including the backbone) and works on complex images like COCO, paving the path for unsupervised representation learning using object detection directly as a pretext task.
Paper Structure (23 sections, 2 equations, 5 figures, 18 tables, 1 algorithm)

This paper contains 23 sections, 2 equations, 5 figures, 18 tables, 1 algorithm.

Figures (5)

  • Figure 1: AptDet overview:(i) Object proposals are extracted from images in an unsupervised manner and assigned pseudo-labels via clustering; (ii) The pseudo-labeled object proposals are used to train the detector, which learns to localize objects and discriminate their pseudo-class label; (iii) The detector then generates a new set of proposals and pseudo-labels, which are used for self-training.
  • Figure 2: Overview of AptDet's pretraining Stage 1 for a DETR-based detector. Pseudo-labeled region proposals are extracted at the start of training, leveraging a self-supervised pretrained backbone. Those proposals are then used to train the detector to both localize objects within the image, and to discriminate their pseudo-labels.
  • Figure 3: AP scores on COCO's val2014 novel classes during finetuning with k=10 instances per class. Results averaged over 5 runs.
  • Figure 4: AP scores on COCO's val2014 novel classes during finetuning with k=30 instances per class. Results averaged over 5 runs.
  • Figure 5: Examples of object proposals extracted from AptDet, contrasted with the ground truth, Selective Search and our initial pseudo-labeled object proposals, extracted as described in paper Sec. 3.1. The images belong to COCO train2017. To avoid clutter, we only show predicted objects whose bounding boxes have an IOU greater than 0.5 with at least one ground truth object. Best seen in color.