Contrastive Learning for Enhancing Robust Scene Transfer in Vision-based Agile Flight

Jiaxu Xing; Leonard Bauersfeld; Yunlong Song; Chunwei Xing; Davide Scaramuzza

Contrastive Learning for Enhancing Robust Scene Transfer in Vision-based Agile Flight

Jiaxu Xing, Leonard Bauersfeld, Yunlong Song, Chunwei Xing, Davide Scaramuzza

TL;DR

This work tackles zero-shot scene transfer for vision-based mobile robotics by learning an environment-invariant, task-relevant vision embedding using adaptive multi-pair contrastive learning. It introduces intra- and inter-scene consistency objectives and an adaptive temperature tied to pose similarity, along with a privileged imitation-learning pipeline (DAgger) to train a compact action net that relies on vision embeddings and IMU data. The approach is validated through extensive simulation and real-world quadrotor racing, showing superior embedding quality and improved action learning, including robust transfer to unseen environments. The findings suggest that the proposed contrastive training framework can generalize beyond drone racing to other vision-based sequential robotics tasks, offering a practical path toward zero-shot deployment. The work further demonstrates that combining task-focused representation learning with imitation-based control yields meaningful gains in sample efficiency and real-world robustness over traditional end-to-end or pretrained-world-model baselines.

Abstract

Scene transfer for vision-based mobile robotics applications is a highly relevant and challenging problem. The utility of a robot greatly depends on its ability to perform a task in the real world, outside of a well-controlled lab environment. Existing scene transfer end-to-end policy learning approaches often suffer from poor sample efficiency or limited generalization capabilities, making them unsuitable for mobile robotics applications. This work proposes an adaptive multi-pair contrastive learning strategy for visual representation learning that enables zero-shot scene transfer and real-world deployment. Control policies relying on the embedding are able to operate in unseen environments without the need for finetuning in the deployment environment. We demonstrate the performance of our approach on the task of agile, vision-based quadrotor flight. Extensive simulation and real-world experiments demonstrate that our approach successfully generalizes beyond the training domain and outperforms all baselines.

Contrastive Learning for Enhancing Robust Scene Transfer in Vision-based Agile Flight

TL;DR

Abstract

Paper Structure (23 sections, 6 equations, 4 figures, 2 tables)

This paper contains 23 sections, 6 equations, 4 figures, 2 tables.

Introduction
Related Work
End-to-end policy learning
Visual pre-training for robotics
Methodology
Adaptive Contrastive Learning
Intra-scene consistency
Inter-scene consistency
Action Net Learning
Teacher policy learning
Student policy learning
Experiments
Implementation Details and Model Training
Image data for vision encoder training
Vision encoder training
...and 8 more sections

Figures (4)

Figure 1: We train a vision encoder using our proposed adaptive contrastive learning strategy. Positive examples are sampled from different environments and nearby points and negative examples are from far-away track segments, as shown in (a). Then the action network controlling the robot has access to a history of vision embeddings as well as IMU measurements. The action net predicts the control commands for the mobile robot, e.g. thrust and body rate commands for a quadrotor, as shown in (b). Our adaptive contrastive learning embeds the images into a self-consistent and scene-invariant feature space, shown via a t-SNE van2008visualizing visualization in (c).
Figure 2: This figure shows the intra-consistency of the different methods. Solid lines represent methods that have been finetuned and dash-dotted lines represent pretrained methods. One can clearly see that pretrained methods show a much more similar embedding, even at opposite ends of the track (50%). Our proposed method produces an embedding that is very distinctive, as far points on the track have very dissimilar embedding.
Figure 3: This figure shows the inter-consistency of the different methods. This is a measure of how close the embeddings from different environments are together (higher is better). Our method clearly outperforms the baselines, as a high inter-consistency is desired while ensuring a low intra-consistency for far apart samples. The numbers below each bar indicate the mean embedding similarity $\mu$ and inter-quartile-range iqr.
Figure 4: t-SNE Visualizations of the embeddings for different environments. The illustration graphically summarizes what can be read quantitatively from Fig. \ref{['fig:intraconsistency']} and Fig. \ref{['fig:interconsistency']}: only our approach produces distinctive, task-related embeddings that are similar across all environments.

Contrastive Learning for Enhancing Robust Scene Transfer in Vision-based Agile Flight

TL;DR

Abstract

Contrastive Learning for Enhancing Robust Scene Transfer in Vision-based Agile Flight

Authors

TL;DR

Abstract

Table of Contents

Figures (4)