Contrastive Learning for Enhancing Robust Scene Transfer in Vision-based Agile Flight
Jiaxu Xing, Leonard Bauersfeld, Yunlong Song, Chunwei Xing, Davide Scaramuzza
TL;DR
This work tackles zero-shot scene transfer for vision-based mobile robotics by learning an environment-invariant, task-relevant vision embedding using adaptive multi-pair contrastive learning. It introduces intra- and inter-scene consistency objectives and an adaptive temperature tied to pose similarity, along with a privileged imitation-learning pipeline (DAgger) to train a compact action net that relies on vision embeddings and IMU data. The approach is validated through extensive simulation and real-world quadrotor racing, showing superior embedding quality and improved action learning, including robust transfer to unseen environments. The findings suggest that the proposed contrastive training framework can generalize beyond drone racing to other vision-based sequential robotics tasks, offering a practical path toward zero-shot deployment. The work further demonstrates that combining task-focused representation learning with imitation-based control yields meaningful gains in sample efficiency and real-world robustness over traditional end-to-end or pretrained-world-model baselines.
Abstract
Scene transfer for vision-based mobile robotics applications is a highly relevant and challenging problem. The utility of a robot greatly depends on its ability to perform a task in the real world, outside of a well-controlled lab environment. Existing scene transfer end-to-end policy learning approaches often suffer from poor sample efficiency or limited generalization capabilities, making them unsuitable for mobile robotics applications. This work proposes an adaptive multi-pair contrastive learning strategy for visual representation learning that enables zero-shot scene transfer and real-world deployment. Control policies relying on the embedding are able to operate in unseen environments without the need for finetuning in the deployment environment. We demonstrate the performance of our approach on the task of agile, vision-based quadrotor flight. Extensive simulation and real-world experiments demonstrate that our approach successfully generalizes beyond the training domain and outperforms all baselines.
