Dashing for the Golden Snitch: Multi-Drone Time-Optimal Motion Planning with Multi-Agent Reinforcement Learning

Xian Wang; Jin Zhou; Yuanli Feng; Jiahao Mei; Jiming Chen; Shuo Li

Dashing for the Golden Snitch: Multi-Drone Time-Optimal Motion Planning with Multi-Agent Reinforcement Learning

Xian Wang, Jin Zhou, Yuanli Feng, Jiahao Mei, Jiming Chen, Shuo Li

TL;DR

This work addresses time-optimal motion planning for multi-drone swarms under collision avoidance by learning decentralized policies with multi-agent reinforcement learning. It introduces a CTDE framework using Independent PPO (IPPO) with a shared policy and centralized critic, augmented by invalid-experience masking and value normalization, and employs a soft collision-free mechanism with a safety tolerance. The method leverages a DEC-POMDP formulation, a carefully designed four-term reward, and a simplified quadrotor model to enable online, onboard inference, with extensive simulations and real-world flights demonstrating near-time-optimal performance and low collision rates, including two- and five-quadrotor scenarios reaching up to 27.1 m/s in simulation and 13.65 m/s in real hardware. The results indicate strong potential for scalable, high-speed multi-drone operations in dynamic environments, while suggesting avenues for future enhancements such as temporal prediction, LiDAR-based sensing, and team-based coordination strategies.

Abstract

Recent innovations in autonomous drones have facilitated time-optimal flight in single-drone configurations, and enhanced maneuverability in multi-drone systems by applying optimal control and learning-based methods. However, few studies have achieved time-optimal motion planning for multi-drone systems, particularly during highly agile maneuvers or in dynamic scenarios. This paper presents a decentralized policy network using multi-agent reinforcement learning for time-optimal multi-drone flight. To strike a balance between flight efficiency and collision avoidance, we introduce a soft collision-free mechanism inspired by optimization-based methods. By customizing PPO in a centralized training, decentralized execution (CTDE) fashion, we unlock higher efficiency and stability in training while ensuring lightweight implementation. Extensive simulations show that, despite slight performance trade-offs compared to single-drone systems, our multi-drone approach maintains near-time-optimal performance with a low collision rate. Real-world experiments validate our method, with two quadrotors using the same network as in simulation achieving a maximum speed of 13.65 m/s and a maximum body rate of 13.4 rad/s in a 5.5 m * 5.5 m * 2.0 m space across various tracks, relying entirely on onboard computation.

Dashing for the Golden Snitch: Multi-Drone Time-Optimal Motion Planning with Multi-Agent Reinforcement Learning

TL;DR

Abstract

Paper Structure (17 sections, 10 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 17 sections, 10 equations, 6 figures, 5 tables, 1 algorithm.

INTRODUCTION
METHODOLOGY
The multi-agent RL problem
Observations and actions
Rewards
Policy training
Multi-agent RL algorithm
Quadrotor dynamics
SIMULATION RESULTS AND ANALYSIS
Training details
Simulation Results
Comparative Simulation Experiments
Evaluation of a Five-Drone System
EXPERIMENT SETUP AND RESULT
Star Track Flight
...and 2 more sections

Figures (6)

Figure 1: Two quadrotors executing highly agile maneuvers, guided by onboard policy networks in a real-world flight.
Figure 2: Overview of the proposed method illustrating CTDE framework. During the training, all drones share a common policy network $\pi_\theta$ and a value network $V_\phi$, with data collected in parallel and stored in a shared rollout buffer. For deployment, each drone independently executes the policy in a decentralized manner, using identical network parameters while making decisions based on its local information.
Figure 3: Visualized quadrotor trajectories on different tracks for single, two, and three quadrotors. Each track has undergone thousands of tests with added noise to the waypoint positions. The plots show quadrotors completing three continuous laps, with shaded areas in the first row marking the waypoints within a radius of $d_{\text{w}}$ for a single lap. Despite a slight reduction in performance, the multi-drone approach achieves competitive results with low collision probability, proving effective across various scenarios.
Figure 4: Trajectories of five quadrotors on the 2019 AlphaPilot Challenge course using our decentralized policy, reaching a peak speed of 27.1 m/s, a collision rate of 5.9%, and a success rate of 83.8%.
Figure 5: Time-history of position coordinates during Star track experiments: the time-optimal CPC method (red dashed), the single-drone simulation (yellow dash-dotted), the single-drone real-world flight (orange dash-dotted), the two-drone real flights (solid). All data corresponds to first-lap performance under identical initial conditions.
...and 1 more figures

Dashing for the Golden Snitch: Multi-Drone Time-Optimal Motion Planning with Multi-Agent Reinforcement Learning

TL;DR

Abstract

Dashing for the Golden Snitch: Multi-Drone Time-Optimal Motion Planning with Multi-Agent Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)