Table of Contents
Fetching ...

Approximate Imitation Learning for Event-based Quadrotor Flight in Cluttered Environments

Nico Messikommer, Jiaxu Xing, Leonard Bauersfeld, Marco Cannici, Elie Aljalbout, Davide Scaramuzza

TL;DR

This work proposes Approximate Imitation Learning, a novel imitation learning framework that outperforms standard imitation learning baselines in simulation and demonstrates robust performance in real-world flight tests, achieving speeds up to 9.8 ms-1 in cluttered environments.

Abstract

Event cameras offer high temporal resolution and low latency, making them ideal sensors for high-speed robotic applications where conventional cameras suffer from image degradations such as motion blur. In addition, their low power consumption can enhance endurance, which is critical for resource-constrained platforms. Motivated by these properties, we present a novel approach that enables a quadrotor to fly through cluttered environments at high speed by perceiving the environment with a single event camera. Our proposed method employs an end-to-end neural network trained to map event data directly to control commands, eliminating the reliance on standard cameras. To enable efficient training in simulation, where rendering synthetic event data is computationally expensive, we propose Approximate Imitation Learning, a novel imitation learning framework. Our approach leverages a large-scale offline dataset to learn a task-specific representation space. Subsequently, the policy is trained through online interactions that rely solely on lightweight, simulated state information, eliminating the need to render events during training. This enables the efficient training of event-based control policies for fast quadrotor flight, highlighting the potential of our framework for other modalities where data simulation is costly or impractical. Our approach outperforms standard imitation learning baselines in simulation and demonstrates robust performance in real-world flight tests, achieving speeds up to 9.8 ms-1 in cluttered environments.

Approximate Imitation Learning for Event-based Quadrotor Flight in Cluttered Environments

TL;DR

This work proposes Approximate Imitation Learning, a novel imitation learning framework that outperforms standard imitation learning baselines in simulation and demonstrates robust performance in real-world flight tests, achieving speeds up to 9.8 ms-1 in cluttered environments.

Abstract

Event cameras offer high temporal resolution and low latency, making them ideal sensors for high-speed robotic applications where conventional cameras suffer from image degradations such as motion blur. In addition, their low power consumption can enhance endurance, which is critical for resource-constrained platforms. Motivated by these properties, we present a novel approach that enables a quadrotor to fly through cluttered environments at high speed by perceiving the environment with a single event camera. Our proposed method employs an end-to-end neural network trained to map event data directly to control commands, eliminating the reliance on standard cameras. To enable efficient training in simulation, where rendering synthetic event data is computationally expensive, we propose Approximate Imitation Learning, a novel imitation learning framework. Our approach leverages a large-scale offline dataset to learn a task-specific representation space. Subsequently, the policy is trained through online interactions that rely solely on lightweight, simulated state information, eliminating the need to render events during training. This enables the efficient training of event-based control policies for fast quadrotor flight, highlighting the potential of our framework for other modalities where data simulation is costly or impractical. Our approach outperforms standard imitation learning baselines in simulation and demonstrates robust performance in real-world flight tests, achieving speeds up to 9.8 ms-1 in cluttered environments.
Paper Structure (25 sections, 3 equations, 13 figures, 4 tables, 1 algorithm)

This paper contains 25 sections, 3 equations, 13 figures, 4 tables, 1 algorithm.

Figures (13)

  • Figure 1: Overview of Our Proposed Framework. Our proposed framework leverages a large-scale offline dataset of rendered events alongside a real-world dataset covering diverse scenes. To fine-tune the event-based policy, we introduce an approximate student that receives efficiently simulated state information, entirely eliminating the need for computationally expensive event rendering. We validate our framework by flying a quadrotor equipped with a single event camera through cluttered real-world environments using external pose information.
  • Figure 2: Approximate Imitation Learning Overview. During offline training, teacher actions $a_T$ supervise the event student by updating its encoder $F_S$ and the shared action decoder $A$. The state-based approximate student, using teacher observations, is trained by aligning its features $\hat{h}_S$ and actions $\hat{a}_S$ with the student features $h_S$ and actions $a_S$, without backpropagating through the forward pass of the student. Distance maps generated from M3ED Chaney_2023_CVPR serve as targets for updating the event encoder $F_S$ and an auxiliary decoder $D$, adapting the event feature space $h_S$ to real-world events. During online training, teacher observations obtained from lightweight simulations are used to fine-tune the behavior of the approximate student by updating the shared action decoder $A$, implicitly improving the behavior of the event student. For clarity, the auxiliary decoder $D_o$ and the offline data update during online training are omitted.
  • Figure 3: Event Student Overview. The event representations are first encoded into features $h^{t_i}_S$ by the event student encoder $F_S$, which consists of a recurrent EfficientNet tan2019efficientnet followed by the Projection Layers. These event features are then concatenated with the outputs of the Vector Layers, which encode auxiliary inputs, i.e., the direction command $\bar{v}_{t_i}$, the previous action $a^{t_{i-1}}_S$, and, optionally, the state information $s_{t_i}$. The combined features are fused in the Fusion Layer, and the final actions $a^{t_i}_S$ are produced by the Action Head.
  • Figure 4: Vectorized Event Representation Generation. Log intensities are quantized into bands, and band differences between adjacent timesteps are computed. Neighbouring non-zero values are then differenced again. Events are triggered where this second subtraction yields zero.
  • Figure 5: Runtime and Memory Requirements. The maximum and mean GPU memory usage and the required computation time are shown for our vectorized event generation (Vectorized) and ESIM’s GPU-based event generation (ESIM) across different numbers of environments. Our method leads to a 34% reduction in mean runtime (top) and significantly lowers peak GPU memory usage (bottom).
  • ...and 8 more figures