Vision Transformers for End-to-End Vision-Based Quadrotor Obstacle Avoidance
Anish Bhattacharya, Nishanth Rao, Dhruv Parikh, Pratik Kunapuli, Yuwei Wu, Yuezhan Tao, Nikolai Matni, Vijay Kumar
TL;DR
The paper tackles high-speed obstacle avoidance for quadrotors by replacing modular perception-planning-control with an end-to-end vision-transformer-based policy trained from a privileged expert using depth images. It systematically compares ViT, ViT+LSTM, and several baselines (ConvNet, UNet, LSTM variants) in simulation and real hardware, showing that attention-based models, especially ViT+LSTM, achieve lower collision rates and better generalization to unseen environments and narrow gaps. The work demonstrates strong zero-shot transfer to real-world flights up to 7 m/s and provides open-source resources to reproduce the results. Overall, it establishes vision transformers as a viable backbone for reactive, depth-based quadrotor control in cluttered environments.
Abstract
We demonstrate the capabilities of an attention-based end-to-end approach for high-speed vision-based quadrotor obstacle avoidance in dense, cluttered environments, with comparison to various state-of-the-art learning architectures. Quadrotor unmanned aerial vehicles (UAVs) have tremendous maneuverability when flown fast; however, as flight speed increases, traditional model-based approaches to navigation via independent perception, mapping, planning, and control modules breaks down due to increased sensor noise, compounding errors, and increased processing latency. Thus, learning-based, end-to-end vision-to-control networks have shown to have great potential for online control of these fast robots through cluttered environments. We train and compare convolutional, U-Net, and recurrent architectures against vision transformer (ViT) models for depth image-to-control in high-fidelity simulation, observing that ViT models are more effective than others as quadrotor speeds increase and in generalization to unseen environments, while the addition of recurrence further improves performance while reducing quadrotor energy cost across all tested flight speeds. We assess performance at speeds of up to 7m/s in simulation and hardware. To the best of our knowledge, this is the first work to utilize vision transformers for end-to-end vision-based quadrotor control.
