Multi-Agent Deep Reinforcement Learning for Collaborative UAV Relay Networks under Jamming Atatcks
Thai Duong Nguyen, Ngoc-Tan Nguyen, Thanh-Dao Nguyen, Nguyen Van Huynh, Dinh-Hieu Tran, Symeon Chatzinotas
TL;DR
The paper tackles resilient UAV relay networks under jamming by formulating a cooperative DEC-POMDP and solving it with a Centralized Training with Decentralized Execution (CTDE) MARL framework. A centralized critic guides decentralized Q-networks, using a dual reward structure and prioritized experience replay to learn cooperative policies that optimize throughput while avoiding collisions. Results show substantial throughput gains (≈50%+) and near-zero collision rates, with agents exhibiting an emergent anti-jamming strategy that balances interference mitigation and link maintenance. This work demonstrates that emergent, high-performance behaviors can arise from multi-objective learning without explicit programming, offering practical value for contested tactical networks.
Abstract
The deployment of Unmanned Aerial Vehicle (UAV) swarms as dynamic communication relays is critical for next-generation tactical networks. However, operating in contested environments requires solving a complex trade-off, including maximizing system throughput while ensuring collision avoidance and resilience against adversarial jamming. Existing heuristic-based approaches often struggle to find effective solutions due to the dynamic and multi-objective nature of this problem. This paper formulates this challenge as a cooperative Multi-Agent Reinforcement Learning (MARL) problem, solved using the Centralized Training with Decentralized Execution (CTDE) framework. Our approach employs a centralized critic that uses global state information to guide decentralized actors which operate using only local observations. Simulation results show that our proposed framework significantly outperforms heuristic baselines, increasing the total system throughput by approximately 50% while simultaneously achieving a near-zero collision rate. A key finding is that the agents develop an emergent anti-jamming strategy without explicit programming. They learn to intelligently position themselves to balance the trade-off between mitigating interference from jammers and maintaining effective communication links with ground users.
