MACCA: Offline Multi-agent Reinforcement Learning with Causal Credit Assignment

Ziyan Wang; Yali Du; Yudi Zhang; Meng Fang; Biwei Huang

MACCA: Offline Multi-agent Reinforcement Learning with Causal Credit Assignment

Ziyan Wang, Yali Du, Yudi Zhang, Meng Fang, Biwei Huang

TL;DR

This paper proposes a new framework, namely Multi-Agent Causal Credit Assignment (MACCA), to address credit assignment in the offline MARL setting and proves that under the setting of the offline dataset, the underlying causal structure and the function for generating the individual rewards of agents are identifiable.

Abstract

Offline Multi-agent Reinforcement Learning (MARL) is valuable in scenarios where online interaction is impractical or risky. While independent learning in MARL offers flexibility and scalability, accurately assigning credit to individual agents in offline settings poses challenges because interactions with an environment are prohibited. In this paper, we propose a new framework, namely Multi-Agent Causal Credit Assignment (MACCA), to address credit assignment in the offline MARL setting. Our approach, MACCA, characterizing the generative process as a Dynamic Bayesian Network, captures relationships between environmental variables, states, actions, and rewards. Estimating this model on offline data, MACCA can learn each agent's contribution by analyzing the causal relationship of their individual rewards, ensuring accurate and interpretable credit assignment. Additionally, the modularity of our approach allows it to seamlessly integrate with various offline MARL methods. Theoretically, we proved that under the setting of the offline dataset, the underlying causal structure and the function for generating the individual rewards of agents are identifiable, which laid the foundation for the correctness of our modeling. In our experiments, we demonstrate that MACCA not only outperforms state-of-the-art methods but also enhances performance when integrated with other backbones.

MACCA: Offline Multi-agent Reinforcement Learning with Causal Credit Assignment

TL;DR

Abstract

Paper Structure (27 sections, 2 theorems, 16 equations, 6 figures, 9 tables, 1 algorithm)

This paper contains 27 sections, 2 theorems, 16 equations, 6 figures, 9 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
Offline MARL with Causal Credit Assignment
Underlying Generative Process in MARL
Causal Model Learning
Policy Learning with Assigned Individual Rewards.
Individual Rewards Assignment.
Offline Policy Learning.
Experiments
General Implementation
Main Results
Multi-agent Particle Environment (MPE).
StarCraft Micromanagement Challenges (SMAC).
Ablation Studies
...and 12 more sections

Key Result

Proposition 1

Suppose the joint state $\bm s_t$, joint action $\bm a_t$, team reward $R_t$ are observable while the individual $r^i_t$ for each agent are unobserved, and they are from the Dec-POMDP, as described in Eq main_function. Then under the Markov condition and faithfulness assumption (refer to Appendix ap

Figures (6)

Figure 1: The graphic representation of the causal structure within the MACCA framework. The nodes and edges represent the causal relationships among various environmental variables, i.e., different dimensions of these variables for each agent within the team reward Multi-agent MDP context. These dimensions include the different dimensions of the state $s^i_{\cdots,t}$, action $a^i_{\cdots,t}$, individual reward $r^i_t$ for agent $i$, and the team reward $R_t$. The individual reward $r^i_t$ (shown with blue filling) is unobservable, and the aggregation of $r^i_t$ equals ${R}_t$.
Figure 2: The illustration of the MACCA method. The offline data generation process begins on the left side, where data is recorded from the environment. MACCA then constructs a causal model consisting of a DBN represented in grey and an individual reward predictor depicted in blue. The DBN is used to sample scales from each agent, denoted as $c_t^{i,\cdot \rightarrow \cdot}$ and highlighted in green. Meanwhile, the individual reward predictor takes the joint state, action, and these masks as input to generate the individual reward estimate $\hat{r}^i_t$. During the policy learning phase, each agent utilizes their observation and individual reward estimate as inputs, which are then passed through their respective policy network to generate the next-state actions.
Figure 3: The figure visualizes the causal structure, showing the probability of causal edges from blue (high probability) to yellow (low probability). (a) represents the causal structure $\hat{c}_t^{i,s\rightarrow r}$ between the state of all agents (18 dimensions for each agent, 54 dimensions for joint state ) and the individual reward (1 dimension for each agent). (b) represents the causal structure $\hat{c}_t^{i, a \rightarrow r}$ between the action of each agent (2 dimensions for each agent, six dimensions for joint action) and the individual reward (1 dimension for each agent).
Figure : Average normalized scores for ground truth individual reward comparison in MPE-CN
Figure : Average win rate in SMAC 5m_vs_6m map, expert dataset.
...and 1 more figures

Theorems & Definitions (2)

Proposition 1: Identifiability of Causal Structure and Individual Reward Function
Proposition 1: Individual Reward Function Identifiability

MACCA: Offline Multi-agent Reinforcement Learning with Causal Credit Assignment

TL;DR

Abstract

MACCA: Offline Multi-agent Reinforcement Learning with Causal Credit Assignment

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (2)