Self-Supervised Learning of Visual Robot Localization Using LED State Prediction as a Pretext Task

Mirko Nava; Nicholas Carlotti; Luca Crupi; Daniele Palossi; Alessandro Giusti

Self-Supervised Learning of Visual Robot Localization Using LED State Prediction as a Pretext Task

Mirko Nava, Nicholas Carlotti, Luca Crupi, Daniele Palossi, Alessandro Giusti

TL;DR

This work tackles the problem of visually localizing a target robot from a monocular camera with very limited ground-truth labels. It introduces a self-supervised pretext task that predicts the LED state (on/off) of the target drone, guiding the network to learn features relevant to localization. A fully convolutional network jointly learns a location map and an LED-state map, optimized with a combined loss ${\cal L}=(1-\\lambda){\cal L}_{task}+\\lambda{\cal L}_{pretext}$ where ${\cal L}_{pretext}$ uses blue-ground truth LED labels on a large unlabeled set. Empirical results on nano-drones show significant improvements over baselines and alternative pretraining strategies, including a field deployment achieving 21 fps and reducing mean tracking error from 11.9 cm to 4.2 cm. The approach reduces labeling requirements and is practical for onboard, real-time vision-based tracking in constrained platforms.

Abstract

We propose a novel self-supervised approach for learning to visually localize robots equipped with controllable LEDs. We rely on a few training samples labeled with position ground truth and many training samples in which only the LED state is known, whose collection is cheap. We show that using LED state prediction as a pretext task significantly helps to learn the visual localization end task. The resulting model does not require knowledge of LED states during inference. We instantiate the approach to visual relative localization of nano-quadrotors: experimental results show that using our pretext task significantly improves localization accuracy (from 68.3% to 76.2%) and outperforms alternative strategies, such as a supervised baseline, model pre-training, and an autoencoding pretext task. We deploy our model aboard a 27-g Crazyflie nano-drone, running at 21 fps, in a position-tracking task of a peer nano-drone. Our approach, relying on position labels for only 300 images, yields a mean tracking error of 4.2 cm versus 11.9 cm of a supervised baseline model trained without our pretext task. Videos and code of the proposed approach are available at https://github.com/idsia-robotics/leds-as-pretext

Self-Supervised Learning of Visual Robot Localization Using LED State Prediction as a Pretext Task

TL;DR

where

uses blue-ground truth LED labels on a large unlabeled set. Empirical results on nano-drones show significant improvements over baselines and alternative pretraining strategies, including a field deployment achieving 21 fps and reducing mean tracking error from 11.9 cm to 4.2 cm. The approach reduces labeling requirements and is practical for onboard, real-time vision-based tracking in constrained platforms.

Abstract

Paper Structure (19 sections, 2 equations, 7 figures, 1 table)

This paper contains 19 sections, 2 equations, 7 figures, 1 table.

Introduction
Related Work
Relative Visual Localization of Drones
Self-Supervised Relative Drone Localization
LED State Prediction as a Pretext Task
Experimental Setup
Robot Platform
Datasets
Alternative Strategies
Network Architectures and Training
From Grid Map to Robot Position
Evaluation Metrics
Results and Discussion
LED State Prediction Improves Performance
Impact of Lambda and amount of Labeled Examples
...and 4 more sections

Figures (7)

Figure 1: A fully convolutional network model is trained to predict the drone position in the current frame by minimizing a loss ${\cal L}_\text{task}$ defined on a small labeled dataset ${\cal T}_l$ (bottom), and the state of the four drone LEDs, by minimizing ${\cal L}_\text{pretext}$ defined on a large dataset ${\cal T}_l \cup {\cal T}_u$ (top).
Figure 2: The palm-sized Bitcraze Crazyflie 2.1 nano-drone platform (10 cm in diameter). (a) The drone's hardware and its four controllable LEDs; (b, c) high-resolution pictures of the flying drone; (d, f) samples from our dataset; (e, g) zoom-in on the drone using the model's receptive field ($45 \times 45$ pixels).
Figure 3: ledp model predictions on the test set $\mathcal{Q}$ with argmax and barycenter approaches for the $u$ and $v$ components of the drone's position.
Figure 4: ledp-30 with $\lambda = 0.001$ (small green circle), bas-30 (yellow cross) predictions and ground truth (large magenta circle) on frames taken from $\mathcal{Q}$ with the drone's LEDs turned on (first three) and off (last three), and featuring different camera exposure settings.
Figure 5: $\text{P}^{+}_{30}$ score for bas $(\lambda = 1)$ and ledp $(\lambda < 1)$ strategies as the amount of labeled training examples $\mathcal{T}_\ell$ and the weight of the loss $\lambda$ vary.
...and 2 more figures

Self-Supervised Learning of Visual Robot Localization Using LED State Prediction as a Pretext Task

TL;DR

Abstract

Self-Supervised Learning of Visual Robot Localization Using LED State Prediction as a Pretext Task

Authors

TL;DR

Abstract

Table of Contents

Figures (7)