Aligned Unsupervised Pretraining of Object Detectors with Self-training
Ioannis Maniadis Metaxas, Adrian Bulat, Ioannis Patras, Brais Martinez, Georgios Tzimiropoulos
TL;DR
AptDet addresses the misalignment between unsupervised detector pretraining and downstream class-aware detection by introducing an end-to-end framework that uses high-level semantic proposals, clustering-based pseudo-labels, and self-training to iteratively refine supervision. It replaces low-level, auxiliary objectives with a principled alignment between pretext and downstream tasks, enabling joint pretraining of the backbone and detector and even training from scratch on scene-centric data. Across DETR-based and R-CNN architectures, AptDet achieves state-of-the-art results on standard benchmarks, demonstrates strong performance in low-data and few-shot regimes, and shows versatility for self-supervised backbone pretraining. The approach simplifies the training pipeline, reduces reliance on complex auxiliary losses, and offers practical avenues for unsupervised representation learning directly from detection as a pretext task.
Abstract
The unsupervised pretraining of object detectors has recently become a key component of object detector training, as it leads to improved performance and faster convergence during the supervised fine-tuning stage. Existing unsupervised pretraining methods, however, typically rely on low-level information to define proposals that are used to train the detector. Furthermore, in the absence of class labels for these proposals, an auxiliary loss is used to add high-level semantics. This results in complex pipelines and a task gap between the pretraining and the downstream task. We propose a framework that mitigates this issue and consists of three simple yet key ingredients: (i) richer initial proposals that do encode high-level semantics, (ii) class pseudo-labeling through clustering, that enables pretraining using a standard object detection training pipeline, (iii) self-training to iteratively improve and enrich the object proposals. Once the pretraining and downstream tasks are aligned, a simple detection pipeline without further bells and whistles can be directly used for pretraining and, in fact, results in state-of-the-art performance on both the full and low data regimes, across detector architectures and datasets, by significant margins. We further show that our pretraining strategy is also capable of pretraining from scratch (including the backbone) and works on complex images like COCO, paving the path for unsupervised representation learning using object detection directly as a pretext task.
