Multi-Agent Meta-Offline Reinforcement Learning for Timely UAV Path Planning and Data Collection
Eslam Eldeeb, Hirley Alves
TL;DR
This work addresses timely UAV path planning and data collection under dynamic network configurations by developing an offline multi-agent reinforcement learning framework that combines Conservative Q-Learning (CQL) with Model-Agnostic Meta-Learning (MAML). It introduces two variants, M-I-CQL and M-CTDE-CQL, leveraging offline data to learn initial policies that rapidly adapt to new objectives like AoI minimization and power reduction. The CTDE-based variant demonstrates faster and more stable convergence, achieving up to ~50% faster adaptation than baselines, and both variants consistently outperform non-MAML offline MARL methods. The approach offers a data-efficient, safe, and scalable solution for real-time UAV coordination in evolving wireless environments.
Abstract
Multi-agent reinforcement learning (MARL) has been widely adopted in high-performance computing and complex data-driven decision-making in the wireless domain. However, conventional MARL schemes face many obstacles in real-world scenarios. First, most MARL algorithms are online, which might be unsafe and impractical. Second, MARL algorithms are environment-specific, meaning network configuration changes require model retraining. This letter proposes a novel meta-offline MARL algorithm that combines conservative Q-learning (CQL) and model agnostic meta-learning (MAML). CQL enables offline training by leveraging pre-collected datasets, while MAML ensures scalability and adaptability to dynamic network configurations and objectives. We propose two algorithm variants: independent training (M-I-MARL) and centralized training decentralized execution (M-CTDE-MARL). Simulation results show that the proposed algorithm outperforms conventional schemes, especially the CTDE approach that achieves 50 % faster convergence in dynamic scenarios than the benchmarks. The proposed framework enhances scalability, robustness, and adaptability in wireless communication systems by optimizing UAV trajectories and scheduling policies.
