Decentralized End-to-End Multi-AAV Pursuit Using Predictive Spatio-Temporal Observation via Deep Reinforcement Learning

Yude Li; Zhexuan Zhou; Huizhe Li; Yanke Sun; Yenan Wu; Yichen Lai; Yiming Wang; Youmin Gong; Jie Mei

Decentralized End-to-End Multi-AAV Pursuit Using Predictive Spatio-Temporal Observation via Deep Reinforcement Learning

Yude Li, Zhexuan Zhou, Huizhe Li, Yanke Sun, Yenan Wu, Yichen Lai, Yiming Wang, Youmin Gong, Jie Mei

Abstract

Decentralized cooperative pursuit in cluttered environments is challenging for autonomous aerial swarms, especially under partial and noisy perception. Existing methods often rely on abstracted geometric features or privileged ground-truth states, and therefore sidestep perceptual uncertainty in real-world settings. We propose a decentralized end-to-end multi-agent reinforcement learning (MARL) framework that maps raw LiDAR observations directly to continuous control commands. Central to the framework is the Predictive Spatio-Temporal Observation (PSTO), an egocentric grid representation that aligns obstacle geometry with predictive adversarial intent and teammate motion in a unified, fixed-resolution projection. Built on PSTO, a single decentralized policy enables agents to navigate static obstacles, intercept dynamic targets, and maintain cooperative encirclement. Simulations demonstrate that the proposed method achieves superior capture efficiency and competitive success rates compared to state-of-the-art learning-based approaches relying on privileged obstacle information. Furthermore, the unified policy scales seamlessly across different team sizes without retraining. Finally, fully autonomous outdoor experiments validate the framework on a quadrotor swarm relying on only onboard sensing and computing.

Decentralized End-to-End Multi-AAV Pursuit Using Predictive Spatio-Temporal Observation via Deep Reinforcement Learning

Abstract

Paper Structure (21 sections, 11 equations, 5 figures, 2 tables)

This paper contains 21 sections, 11 equations, 5 figures, 2 tables.

INTRODUCTION
RELATED WORK
METHODOLOGY
Problem Formulation as Dec-POMDP
Predictive Spatio-Temporal Observation Representation
Mapping 3D Body-Frame Points to 2D Grid Coordinates
Channel Generation
Policy Architecture and Training
Dual-Stream Convolutional Backbone:
Actor-Critic Architecture
Reward Function
Progressive Training Curriculum
RESULTS AND DISCUSSION
Experimental Setup
Comparison with Traditional Methods
...and 6 more sections

Figures (5)

Figure 1: Validation in an unstructured outdoor environment. (a) Time-lapse of a fully autonomous 2-vs-1 pursuit. Pursuers (blue) execute a coordinated pincer maneuver to encircle the Evader (orange) among physical obstacles within a 9.0 m virtual boundary (blue ring). Dashed lines visualize the formation geometry at synchronized timestamps. (b) Global state visualization. Trajectories and perception data visualized in the global frame. Note the artificial point cloud wall superimposed to enforce arena constraints.
Figure 2: System overview of the proposed decentralized end-to-end pursuit framework. The architecture follows the CTDE paradigm. (Top) During decentralized execution, each pursuer generates the Predictive Spatio-Temporal Observation (PSTO), which is fused with proprioceptive data by the policy network to output continuous control commands. (Bottom) During centralized training, a critic network utilizes global state information to guide the policy optimization.
Figure 3: Visualization of the PSTO generation. (a) Egocentric View: The local environment observed by the Ego-Pursuer (center) in its body frame. It perceives the predicted Evader (pink), Teammates (blue), and static Obstacles (green) relative to itself. (b) & (c) PSTO Channels: The corresponding PSTO generated by the Ego-Pursuer. (b) shows the obstacle depth map projected from the green squares in (a), and (c) shows the intent heatmap combining attraction to the Evader and repulsion from Teammates.
Figure 4: Trajectory visualization ($1.6$ m/s, 8 obstacles). (a)--(c) Traditional Heuristics: Collision ('$\times$') or escape due to poor coordination. (d) OPEN (SOTA) chen2025online: Captures ('$\star$') but requires privileged obstacle states. (e)--(h) Ablations: (e)--(g) Fail (speed deficit/containment loss) (h) Captures ('$\star$') but takes a conservative detour. (i)--(k) Ours (PSTO) & Scalability: (i) Executes rapid capture ('$\star$'). (j)--(k) The unified policy scales to 3-vs-1 and 4-vs-1, achieving multi-angle containment. (Note: Dashed lines visualize synchronized formation geometry; baselines continue after collision to show trends.)
Figure 5: Fully autonomous system and outdoor validation. (a) The quadrotor equipped with Livox Mid-360 and NUC 13. (b) Swarm pursuit in an unstructured outdoor environment. (c) Quantitative results corresponding to the mission visualized in Fig. \ref{['main']}. The distance evolution reveals a pincer maneuver, culminating in successful encirclement.

Decentralized End-to-End Multi-AAV Pursuit Using Predictive Spatio-Temporal Observation via Deep Reinforcement Learning

Abstract

Decentralized End-to-End Multi-AAV Pursuit Using Predictive Spatio-Temporal Observation via Deep Reinforcement Learning

Authors

Abstract

Table of Contents

Figures (5)