Look Around and Learn: Self-Training Object Detection by Exploration
Gianluca Scarpellini, Stefano Rosa, Pietro Morerio, Lorenzo Natale, Alessio Del Bue
TL;DR
This work addresses the degradation of object detector performance in novel environments by proposing a fully self-supervised framework where an embodied agent actively explores to improve perception. The Look Around action module guides exploration toward uncertain instances using a 3D semantic voxel map and a disagreement-based reward, while Disagreement Reconciliation refines multi-view predictions into consistent pseudo-labels and soft targets for self-training. Across Gibson Habitat simulations and real-robot tests, the approach achieves state-of-the-art gains (e.g., $mAP_{@50}=46.60\%$ in simulation and $+9.97\%$ on a real robot) without ground-truth annotations, and is supported by a unified benchmarking framework and public code. The combination of an uncertainty-driven exploration policy with a contrastive instance-matching and soft-distillation training regime offers a practical path to robust, self-supervised adaptation of detectors in new environments.
Abstract
When an object detector is deployed in a novel setting it often experiences a drop in performance. This paper studies how an embodied agent can automatically fine-tune a pre-existing object detector while exploring and acquiring images in a new environment without relying on human intervention, i.e., a fully self-supervised approach. In our setting, an agent initially learns to explore the environment using a pre-trained off-the-shelf detector to locate objects and associate pseudo-labels. By assuming that pseudo-labels for the same object must be consistent across different views, we learn the exploration policy Look Around to mine hard samples, and we devise a novel mechanism called Disagreement Reconciliation for producing refined pseudo-labels from the consensus among observations. We implement a unified benchmark of the current state-of-the-art and compare our approach with pre-existing exploration policies and perception mechanisms. Our method is shown to outperform existing approaches, improving the object detector by 6.2% in a simulated scenario, a 3.59% advancement over other state-of-the-art methods, and by 9.97% in the real robotic test without relying on ground-truth. Code for the proposed approach and baselines are available at https://iit-pavis.github.io/Look_Around_And_Learn/.
