Self-Supervised Learning of Visual Robot Localization Using LED State Prediction as a Pretext Task
Mirko Nava, Nicholas Carlotti, Luca Crupi, Daniele Palossi, Alessandro Giusti
TL;DR
This work tackles the problem of visually localizing a target robot from a monocular camera with very limited ground-truth labels. It introduces a self-supervised pretext task that predicts the LED state (on/off) of the target drone, guiding the network to learn features relevant to localization. A fully convolutional network jointly learns a location map and an LED-state map, optimized with a combined loss ${\cal L}=(1-\\lambda){\cal L}_{task}+\\lambda{\cal L}_{pretext}$ where ${\cal L}_{pretext}$ uses blue-ground truth LED labels on a large unlabeled set. Empirical results on nano-drones show significant improvements over baselines and alternative pretraining strategies, including a field deployment achieving 21 fps and reducing mean tracking error from 11.9 cm to 4.2 cm. The approach reduces labeling requirements and is practical for onboard, real-time vision-based tracking in constrained platforms.
Abstract
We propose a novel self-supervised approach for learning to visually localize robots equipped with controllable LEDs. We rely on a few training samples labeled with position ground truth and many training samples in which only the LED state is known, whose collection is cheap. We show that using LED state prediction as a pretext task significantly helps to learn the visual localization end task. The resulting model does not require knowledge of LED states during inference. We instantiate the approach to visual relative localization of nano-quadrotors: experimental results show that using our pretext task significantly improves localization accuracy (from 68.3% to 76.2%) and outperforms alternative strategies, such as a supervised baseline, model pre-training, and an autoencoding pretext task. We deploy our model aboard a 27-g Crazyflie nano-drone, running at 21 fps, in a position-tracking task of a peer nano-drone. Our approach, relying on position labels for only 300 images, yields a mean tracking error of 4.2 cm versus 11.9 cm of a supervised baseline model trained without our pretext task. Videos and code of the proposed approach are available at https://github.com/idsia-robotics/leds-as-pretext
