Table of Contents
Fetching ...

Vision Transformers for End-to-End Vision-Based Quadrotor Obstacle Avoidance

Anish Bhattacharya, Nishanth Rao, Dhruv Parikh, Pratik Kunapuli, Yuwei Wu, Yuezhan Tao, Nikolai Matni, Vijay Kumar

TL;DR

The paper tackles high-speed obstacle avoidance for quadrotors by replacing modular perception-planning-control with an end-to-end vision-transformer-based policy trained from a privileged expert using depth images. It systematically compares ViT, ViT+LSTM, and several baselines (ConvNet, UNet, LSTM variants) in simulation and real hardware, showing that attention-based models, especially ViT+LSTM, achieve lower collision rates and better generalization to unseen environments and narrow gaps. The work demonstrates strong zero-shot transfer to real-world flights up to 7 m/s and provides open-source resources to reproduce the results. Overall, it establishes vision transformers as a viable backbone for reactive, depth-based quadrotor control in cluttered environments.

Abstract

We demonstrate the capabilities of an attention-based end-to-end approach for high-speed vision-based quadrotor obstacle avoidance in dense, cluttered environments, with comparison to various state-of-the-art learning architectures. Quadrotor unmanned aerial vehicles (UAVs) have tremendous maneuverability when flown fast; however, as flight speed increases, traditional model-based approaches to navigation via independent perception, mapping, planning, and control modules breaks down due to increased sensor noise, compounding errors, and increased processing latency. Thus, learning-based, end-to-end vision-to-control networks have shown to have great potential for online control of these fast robots through cluttered environments. We train and compare convolutional, U-Net, and recurrent architectures against vision transformer (ViT) models for depth image-to-control in high-fidelity simulation, observing that ViT models are more effective than others as quadrotor speeds increase and in generalization to unseen environments, while the addition of recurrence further improves performance while reducing quadrotor energy cost across all tested flight speeds. We assess performance at speeds of up to 7m/s in simulation and hardware. To the best of our knowledge, this is the first work to utilize vision transformers for end-to-end vision-based quadrotor control.

Vision Transformers for End-to-End Vision-Based Quadrotor Obstacle Avoidance

TL;DR

The paper tackles high-speed obstacle avoidance for quadrotors by replacing modular perception-planning-control with an end-to-end vision-transformer-based policy trained from a privileged expert using depth images. It systematically compares ViT, ViT+LSTM, and several baselines (ConvNet, UNet, LSTM variants) in simulation and real hardware, showing that attention-based models, especially ViT+LSTM, achieve lower collision rates and better generalization to unseen environments and narrow gaps. The work demonstrates strong zero-shot transfer to real-world flights up to 7 m/s and provides open-source resources to reproduce the results. Overall, it establishes vision transformers as a viable backbone for reactive, depth-based quadrotor control in cluttered environments.

Abstract

We demonstrate the capabilities of an attention-based end-to-end approach for high-speed vision-based quadrotor obstacle avoidance in dense, cluttered environments, with comparison to various state-of-the-art learning architectures. Quadrotor unmanned aerial vehicles (UAVs) have tremendous maneuverability when flown fast; however, as flight speed increases, traditional model-based approaches to navigation via independent perception, mapping, planning, and control modules breaks down due to increased sensor noise, compounding errors, and increased processing latency. Thus, learning-based, end-to-end vision-to-control networks have shown to have great potential for online control of these fast robots through cluttered environments. We train and compare convolutional, U-Net, and recurrent architectures against vision transformer (ViT) models for depth image-to-control in high-fidelity simulation, observing that ViT models are more effective than others as quadrotor speeds increase and in generalization to unseen environments, while the addition of recurrence further improves performance while reducing quadrotor energy cost across all tested flight speeds. We assess performance at speeds of up to 7m/s in simulation and hardware. To the best of our knowledge, this is the first work to utilize vision transformers for end-to-end vision-based quadrotor control.
Paper Structure (16 sections, 1 equation, 7 figures, 2 tables)

This paper contains 16 sections, 1 equation, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Zero-shot sim-to-real transfer for high-speed and multi-obstacle, 3D evasive maneuver with a combination vision transformer-recurrent model.
  • Figure 2: Depth images $\mathbf{im}_{depth}$ and quadrotor orientation $\mathbf{q}_{att}$ come from Flightmare, and along with forward velocity $v_{fwd}$ serve as input to the chosen learning model. The model outputs a linear velocity command $\mathbf{v}_{pred}$ which leads to the formation of a min-snap trajectory tracked by a geometric controller, both part of the Dodgelib control stack dodgedrone-competition. This outputs rotor speeds, which is sent to the quadrotor simulator in Flightmare.
  • Figure 3: Expert policy visualized from the quadrotor onboard camera in a sample training environment. \ref{['fig:expert-wpts']} shows waypoints collision-queried at a fixed horizon given privileged obstacle location information (red: in collision, blue: free) where the green represents the chosen waypoint for issuing a velocity command towards (indicated with an arrow). This velocity command and the corresponding depth image (\ref{['fig:expert-depth']}), as well as quadrotor attitude and velocity, are the collected data at this timestamp.
  • Figure 4: In \ref{['fig:spheres-col-rate']}, collision rates in a previously-unseen Spheres environment are lower for ViT+LSTM (green) versus the expert and other models beyond 5m/s. \ref{['fig:trees-col-rate']} shows that in a novel-obstacle Trees environment, the ViT-based models (green, grey) generalize and perform substantially better than other models. \ref{['fig:traj-vis']} qualitatively depicts the path distribution for each model in a fixed scene, where ViT+LSTM (green) appears to consistently take a more direct path through the cluttered environment (note non-equal axis aspect ratios). \ref{['fig:spheres-energy']} presents the estimated energy cost, where ViT+LSTM is better than either component used alone.
  • Figure 5: Attention map visualizations for each model's characteristic operation (sequential convolution, UNet-style convolution, or attention). Depth images are used as model input, but color images with light/dark or green/purple highlights (indicating high/low attention), are shown on color or depth images for simulated or real experiments, respectively.
  • ...and 2 more figures