Table of Contents
Fetching ...

LEARN: Learning End-to-End Aerial Resource-Constrained Multi-Robot Navigation

Darren Chiu, Zhehui Huang, Ruohai Ge, Gaurav S. Sukhatme

TL;DR

This work tackles onboard, resource-constrained multi-UAV navigation in cluttered environments by introducing LEARN, a lightweight two-stage safety-guided reinforcement learning framework. LEARN combines a minimal planning cue with an attention-based policy and a two-stage safety reward that leverages control barrier concepts, enabling fully onboard perception, planning, and control without external infrastructure. In simulation, LEARN outperforms two state-of-the-art planners by about 10% while using far fewer resources, and it scales to 6 real Crazyflie quads with speeds up to $2.0\ \mathrm{m/s}$ and through $0.2\ \mathrm{m}$ gaps, with zero-shot sim-to-real transfer demonstrated in diverse indoor/outdoor settings. The approach is highly bandwidth- and compute-efficient, requiring only local neighbor data and low-dimensional ToF obstacle sensing, and it supports robust operation under communication delays, making it practical for scalable nano-UAV swarms and real-world deployment.

Abstract

Nano-UAV teams offer great agility yet face severe navigation challenges due to constrained onboard sensing, communication, and computation. Existing approaches rely on high-resolution vision or compute-intensive planners, rendering them infeasible for these platforms. We introduce LEARN, a lightweight, two-stage safety-guided reinforcement learning (RL) framework for multi-UAV navigation in cluttered spaces. Our system combines low-resolution Time-of-Flight (ToF) sensors and a simple motion planner with a compact, attention-based RL policy. In simulation, LEARN outperforms two state-of-the-art planners by $10\%$ while using substantially fewer resources. We demonstrate LEARN's viability on six Crazyflie quadrotors, achieving fully onboard flight in diverse indoor and outdoor environments at speeds up to $2.0 m/s$ and traversing $0.2 m$ gaps.

LEARN: Learning End-to-End Aerial Resource-Constrained Multi-Robot Navigation

TL;DR

This work tackles onboard, resource-constrained multi-UAV navigation in cluttered environments by introducing LEARN, a lightweight two-stage safety-guided reinforcement learning framework. LEARN combines a minimal planning cue with an attention-based policy and a two-stage safety reward that leverages control barrier concepts, enabling fully onboard perception, planning, and control without external infrastructure. In simulation, LEARN outperforms two state-of-the-art planners by about 10% while using far fewer resources, and it scales to 6 real Crazyflie quads with speeds up to and through gaps, with zero-shot sim-to-real transfer demonstrated in diverse indoor/outdoor settings. The approach is highly bandwidth- and compute-efficient, requiring only local neighbor data and low-dimensional ToF obstacle sensing, and it supports robust operation under communication delays, making it practical for scalable nano-UAV swarms and real-world deployment.

Abstract

Nano-UAV teams offer great agility yet face severe navigation challenges due to constrained onboard sensing, communication, and computation. Existing approaches rely on high-resolution vision or compute-intensive planners, rendering them infeasible for these platforms. We introduce LEARN, a lightweight, two-stage safety-guided reinforcement learning (RL) framework for multi-UAV navigation in cluttered spaces. Our system combines low-resolution Time-of-Flight (ToF) sensors and a simple motion planner with a compact, attention-based RL policy. In simulation, LEARN outperforms two state-of-the-art planners by while using substantially fewer resources. We demonstrate LEARN's viability on six Crazyflie quadrotors, achieving fully onboard flight in diverse indoor and outdoor environments at speeds up to and traversing gaps.

Paper Structure

This paper contains 45 sections, 9 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: LEARN is a lightweight, two-stage safety-guided reinforcement learning framework for multi-UAV navigation in cluttered indoor and outdoor spaces. All processes, including perception, localization, communication, planning, and control, run purely on an embedded single-core controller running at 168 MHz with 192 KB of RAM. A single policy is trained in simulation and duplicated across all quadrotors. During deployment, a minimum snap naive planner produces goal points for the encoder. Quadrotors obtain the two closest neighbor positions and velocities through radio; and obstacles are sensed using a low dimensional time-of-flight sensor. The policy generates individual normalized rotor thrusts that are sent directly to the motors. LEARN is zero-shot transferable to the real world with no fine-tuning. Experiments show that it scales up to 6 quadrotors in the real world and 24 in simulation.
  • Figure 2: Comparison of multi-robot navigation algorithms. We begin by reviewing studies that consider realistic obstacle perception and construct a comparison graph grounded in the algorithms examined across these works. A directed edge indicates that the source algorithm outperformed the destination algorithm in certain experiments reported in the cited paper. denotes control-based methods, and denotes planning-based methods. We omit SGBA mcguire_2019 from the comparison since its objective is limited to collision avoidance and exploration, without addressing goal-reaching.
  • Figure 3: Hardware System. The Crazyflie platform used in the real world (a) and in simulation. The quadrotor is equipped with a set of 4 VL53L5CX sensors that each provide an $8\times8$ depth image (b). Each quadrotor is $9.2cm$ in size and weighs merely $47g$. We utilize the onboard nRF51822 radio to communicate neighbor positions and velocities. (c) shows the individual components where number and color denotes the corresponding compute component and (d) the breakdown of compute usage.
  • Figure 4: Method Overview. The training framework incorporates a two stage safety based reward function using Safety Barrier Certificates wang2017safety. An asynchronous actor critic architecture is used where the critic observes a signed distance field (SDF) and employs a recurrent multi-headed attention architecture. At the beginning of each episode we generate a minimum snap trajectory. The trajectory is evaluated at each controller step to generate a 13 dimension goal point which is subtracted from the current state. The green denotes what is deployed on hardware and gray for purposes of training only. The same policy is both trained and deployed across all quadrotors.
  • Figure 5: Training Curves. PPO training curves for different model variants are shown above. The final policy shown by the blue curve is the one used for all experiments, including real world and simulation. * is the final policy trained using Population Based Training jaderberg2017populationbasedtrainingneural. † denotes a version where the safety reward is for all steps (single stage). The larger policy employs a hidden size of $64$, as opposed to $32$ in main experiments. The base policy from huang2024collision is also compared as a baseline.
  • ...and 10 more figures