Table of Contents
Fetching ...

Belief States for Cooperative Multi-Agent Reinforcement Learning under Partial Observability

Paul J. Pritz, Kin K. Leung

TL;DR

This work tackles cooperative multi-agent reinforcement learning under partial observability by learning per-agent belief states through self-supervised pre-training with full state data. The Belief-I2Q framework integrates these beliefs into a decentralized state-based Q-learning approach, extending I2Q to handle partial observability without centralized training data. Experiments across four partially observable grid-world tasks show improved convergence and final performance in several domains, highlighting the value of separating representation learning (beliefs) from policy learning in a truly DTDE setting. The approach potentially enables scalable, decentralized coordination in complex multi-agent environments by leveraging probabilistic state beliefs and uncertainty.

Abstract

Reinforcement learning in partially observable environments is typically challenging, as it requires agents to learn an estimate of the underlying system state. These challenges are exacerbated in multi-agent settings, where agents learn simultaneously and influence the underlying state as well as each others' observations. We propose the use of learned beliefs on the underlying state of the system to overcome these challenges and enable reinforcement learning with fully decentralized training and execution. Our approach leverages state information to pre-train a probabilistic belief model in a self-supervised fashion. The resulting belief states, which capture both inferred state information as well as uncertainty over this information, are then used in a state-based reinforcement learning algorithm to create an end-to-end model for cooperative multi-agent reinforcement learning under partial observability. By separating the belief and reinforcement learning tasks, we are able to significantly simplify the policy and value function learning tasks and improve both the convergence speed and the final performance. We evaluate our proposed method on diverse partially observable multi-agent tasks designed to exhibit different variants of partial observability.

Belief States for Cooperative Multi-Agent Reinforcement Learning under Partial Observability

TL;DR

This work tackles cooperative multi-agent reinforcement learning under partial observability by learning per-agent belief states through self-supervised pre-training with full state data. The Belief-I2Q framework integrates these beliefs into a decentralized state-based Q-learning approach, extending I2Q to handle partial observability without centralized training data. Experiments across four partially observable grid-world tasks show improved convergence and final performance in several domains, highlighting the value of separating representation learning (beliefs) from policy learning in a truly DTDE setting. The approach potentially enables scalable, decentralized coordination in complex multi-agent environments by leveraging probabilistic state beliefs and uncertainty.

Abstract

Reinforcement learning in partially observable environments is typically challenging, as it requires agents to learn an estimate of the underlying system state. These challenges are exacerbated in multi-agent settings, where agents learn simultaneously and influence the underlying state as well as each others' observations. We propose the use of learned beliefs on the underlying state of the system to overcome these challenges and enable reinforcement learning with fully decentralized training and execution. Our approach leverages state information to pre-train a probabilistic belief model in a self-supervised fashion. The resulting belief states, which capture both inferred state information as well as uncertainty over this information, are then used in a state-based reinforcement learning algorithm to create an end-to-end model for cooperative multi-agent reinforcement learning under partial observability. By separating the belief and reinforcement learning tasks, we are able to significantly simplify the policy and value function learning tasks and improve both the convergence speed and the final performance. We evaluate our proposed method on diverse partially observable multi-agent tasks designed to exhibit different variants of partial observability.

Paper Structure

This paper contains 22 sections, 11 equations, 5 figures, 1 algorithm.

Figures (5)

  • Figure 1: CVAE architecture to learn beliefs over states from local histories. The encoder part of the model is only used during pre-training and discarded for the RL part. All model components are trained using the $\mathcal{L}^{CVAE}$ loss. Both encoder and decoder losses are back-propagated to the history encoder. Each agent maintains its own model.
  • Figure 2: Oracle and Gathering environments.
  • Figure 3: Escape Room and HoneyComb environments.
  • Figure 4: Evaluation results of our approach Belief-I2Q against recurrent baselines of I2Q and hysteretic Q-learning. The plots show returns per episode, smoothed over 100 episodes. The results are averaged over three random seeds. The shaded areas show the standard deviation of the results across random seeds.
  • Figure 5: Visualization of the belief state before (LHS) and after (RHS) querying the oracle. The output from the belief model, mean and standard deviation are averaged over 100 episodes per belief state visualization. The plotted standard deviation is, therefore, the average of standard deviations from the belief model across samples, not the standard deviation of means across the samples. The dashed lines in the contours represent one standard deviation. For this visualization, we only query belief states from one agent.