Table of Contents
Fetching ...

Conservative and Risk-Aware Offline Multi-Agent Reinforcement Learning

Eslam Eldeeb, Houssem Sifaou, Osvaldo Simeone, Mohammad Shehab, Hirley Alves

TL;DR

An offline MARL scheme is proposed that integrates distributional RL and conservative Q-learning to address the environment's inherent aleatoric uncertainty and the epistemic uncertainty arising from the use of offline data.

Abstract

Reinforcement learning (RL) has been widely adopted for controlling and optimizing complex engineering systems such as next-generation wireless networks. An important challenge in adopting RL is the need for direct access to the physical environment. This limitation is particularly severe in multi-agent systems, for which conventional multi-agent reinforcement learning (MARL) requires a large number of coordinated online interactions with the environment during training. When only offline data is available, a direct application of online MARL schemes would generally fail due to the epistemic uncertainty entailed by the lack of exploration during training. In this work, we propose an offline MARL scheme that integrates distributional RL and conservative Q-learning to address the environment's inherent aleatoric uncertainty and the epistemic uncertainty arising from the use of offline data. We explore both independent and joint learning strategies. The proposed MARL scheme, referred to as multi-agent conservative quantile regression, addresses general risk-sensitive design criteria and is applied to the trajectory planning problem in drone networks, showcasing its advantages.

Conservative and Risk-Aware Offline Multi-Agent Reinforcement Learning

TL;DR

An offline MARL scheme is proposed that integrates distributional RL and conservative Q-learning to address the environment's inherent aleatoric uncertainty and the epistemic uncertainty arising from the use of offline data.

Abstract

Reinforcement learning (RL) has been widely adopted for controlling and optimizing complex engineering systems such as next-generation wireless networks. An important challenge in adopting RL is the need for direct access to the physical environment. This limitation is particularly severe in multi-agent systems, for which conventional multi-agent reinforcement learning (MARL) requires a large number of coordinated online interactions with the environment during training. When only offline data is available, a direct application of online MARL schemes would generally fail due to the epistemic uncertainty entailed by the lack of exploration during training. In this work, we propose an offline MARL scheme that integrates distributional RL and conservative Q-learning to address the environment's inherent aleatoric uncertainty and the epistemic uncertainty arising from the use of offline data. We explore both independent and joint learning strategies. The proposed MARL scheme, referred to as multi-agent conservative quantile regression, addresses general risk-sensitive design criteria and is applied to the trajectory planning problem in drone networks, showcasing its advantages.
Paper Structure (24 sections, 33 equations, 8 figures, 4 tables, 4 algorithms)

This paper contains 24 sections, 33 equations, 8 figures, 4 tables, 4 algorithms.

Figures (8)

  • Figure 1: Consider access to data collected offline following some fixed and unknown policies $\pi_{\beta} = \{\pi_{\beta}^i\}^I_{i=1}$ in an environment consisting of $I$ agents. Based on this dataset, the goal is to optimize policies $\pi = \{\pi^i\}^I_{i=1}$ for the agents while ensuring robustness to the uncertainty arising from the stochastic environment, from the limited data, and from the lack of interactions with the environment.
  • Figure 2: Illustration of the conditional value-at-risk (CVaR). The quantile function $F_Z^{-1}(\xi)$ is plotted as a function of the risk tolerance level $\xi$. The shaded area representing the lower tail of the distribution depicts the $\xi$-level CVaR.
  • Figure 3: Multiple UAVs serve limited-power sensors to minimize power expenditure while also minimizing the age of information for data retrieval from the sensors. The environment is characterized by a risk region for navigation of the UAVs in the middle of the grid world (colored area).
  • Figure 4: Average test return as a function of the number of training epochs using $\mathrm{P_{risk}}=100$ and $16 \%$ offline dataset for a system of $2$ UAVs serving $10$ sensors. The return is averaged over $100$ test episodes at the end of each training epoch and shown upon division by $1000$.
  • Figure 5: Sum-AoI as a function of the sum-power using $\mathrm{P_{risk}}={\lambda}/{4}$ and $16 \%$ offline dataset for a system of $2$ UAVs serving $10$ sensors.
  • ...and 3 more figures