Table of Contents
Fetching ...

Look Around and Learn: Self-Training Object Detection by Exploration

Gianluca Scarpellini, Stefano Rosa, Pietro Morerio, Lorenzo Natale, Alessio Del Bue

TL;DR

This work addresses the degradation of object detector performance in novel environments by proposing a fully self-supervised framework where an embodied agent actively explores to improve perception. The Look Around action module guides exploration toward uncertain instances using a 3D semantic voxel map and a disagreement-based reward, while Disagreement Reconciliation refines multi-view predictions into consistent pseudo-labels and soft targets for self-training. Across Gibson Habitat simulations and real-robot tests, the approach achieves state-of-the-art gains (e.g., $mAP_{@50}=46.60\%$ in simulation and $+9.97\%$ on a real robot) without ground-truth annotations, and is supported by a unified benchmarking framework and public code. The combination of an uncertainty-driven exploration policy with a contrastive instance-matching and soft-distillation training regime offers a practical path to robust, self-supervised adaptation of detectors in new environments.

Abstract

When an object detector is deployed in a novel setting it often experiences a drop in performance. This paper studies how an embodied agent can automatically fine-tune a pre-existing object detector while exploring and acquiring images in a new environment without relying on human intervention, i.e., a fully self-supervised approach. In our setting, an agent initially learns to explore the environment using a pre-trained off-the-shelf detector to locate objects and associate pseudo-labels. By assuming that pseudo-labels for the same object must be consistent across different views, we learn the exploration policy Look Around to mine hard samples, and we devise a novel mechanism called Disagreement Reconciliation for producing refined pseudo-labels from the consensus among observations. We implement a unified benchmark of the current state-of-the-art and compare our approach with pre-existing exploration policies and perception mechanisms. Our method is shown to outperform existing approaches, improving the object detector by 6.2% in a simulated scenario, a 3.59% advancement over other state-of-the-art methods, and by 9.97% in the real robotic test without relying on ground-truth. Code for the proposed approach and baselines are available at https://iit-pavis.github.io/Look_Around_And_Learn/.

Look Around and Learn: Self-Training Object Detection by Exploration

TL;DR

This work addresses the degradation of object detector performance in novel environments by proposing a fully self-supervised framework where an embodied agent actively explores to improve perception. The Look Around action module guides exploration toward uncertain instances using a 3D semantic voxel map and a disagreement-based reward, while Disagreement Reconciliation refines multi-view predictions into consistent pseudo-labels and soft targets for self-training. Across Gibson Habitat simulations and real-robot tests, the approach achieves state-of-the-art gains (e.g., in simulation and on a real robot) without ground-truth annotations, and is supported by a unified benchmarking framework and public code. The combination of an uncertainty-driven exploration policy with a contrastive instance-matching and soft-distillation training regime offers a practical path to robust, self-supervised adaptation of detectors in new environments.

Abstract

When an object detector is deployed in a novel setting it often experiences a drop in performance. This paper studies how an embodied agent can automatically fine-tune a pre-existing object detector while exploring and acquiring images in a new environment without relying on human intervention, i.e., a fully self-supervised approach. In our setting, an agent initially learns to explore the environment using a pre-trained off-the-shelf detector to locate objects and associate pseudo-labels. By assuming that pseudo-labels for the same object must be consistent across different views, we learn the exploration policy Look Around to mine hard samples, and we devise a novel mechanism called Disagreement Reconciliation for producing refined pseudo-labels from the consensus among observations. We implement a unified benchmark of the current state-of-the-art and compare our approach with pre-existing exploration policies and perception mechanisms. Our method is shown to outperform existing approaches, improving the object detector by 6.2% in a simulated scenario, a 3.59% advancement over other state-of-the-art methods, and by 9.97% in the real robotic test without relying on ground-truth. Code for the proposed approach and baselines are available at https://iit-pavis.github.io/Look_Around_And_Learn/.
Paper Structure (30 sections, 4 equations, 8 figures, 3 tables)

This paper contains 30 sections, 4 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: We equip an agent with an off-the-shelf object detector. The agent explores a new environment and collects a set of noisy detections $D_{0:N}$ following a trajectory $S_{0:N}$ resulting from following intermediate goals $a_{0:N}$ predicted by the exploration policy $\pi$. Such detections are then used for finetuning the detector.
  • Figure 2: Our approach consists of an action and a perception phase. Action:(a) our policy predicts a long-term goal for the agent. (b) During the exploration, the agent builds a semantically consistent voxel map of the environment by projecting the predictions of its object detection into a tri-dimensional space. (c) We project the map onto top-down view and compute a disagreement map by assigning a disagreement score value to each cell. (d) The disagreement map is the input of our policy network. Perception:(e) we collect samples by exploring an environment via the learned policy and (f) project the semantic voxel map onto each observation to build the pseudo-labels for the self-training scheme. (g) Finally, we fine-tune the object detector by relying only on the pseudo-labels.
  • Figure 3: Our policy explores the environment by maximizing the disagreements between predictions for the same object. The instance-matching loss exploits this behavior. It enforces feature vectors belonging to the same object ($\text{u}$ in the Figure) to be close in the feature space while moving away feature vectors of different objects ($\text{u}$ and $\text{v}$).
  • Figure 4: (a) Robot and sensor placement; (b) RGB image with detections superimposed; (c) map of the environment with disagreement superimposed in green.
  • Figure 5: Semantic voxel map creation and projection of detections onto 2D frames. First, we aggregate detections $D_0, \dots, D_N$ into semantic voxel-map. We solve the inconsistency of the voxel-map by assigning to each voxel the class with maximum score among the predictions of the voxel. Next, we project the semantic voxel-map back onto each observation, obtaining $\overline D_0, \dots, \overline D_N$. $\overline D_N$ is the consistent pseudo-label for observation $N$ and is obtained by reprojecting the voxel-map onto RGB-D frame $x_N$. Each pseudo-label $\overline D_i$ is associated to an object instance via the identifier $u_i$ and contains the consistent logits vector $\overline \lambda_{u_i}$.
  • ...and 3 more figures