Table of Contents
Fetching ...

Distributionally Robust Cooperative Multi-Agent Reinforcement Learning via Robust Value Factorization

Chengrui Qu, Christopher Yeh, Kishan Panaganti, Eric Mazumdar, Adam Wierman

TL;DR

This work targets robustness in cooperative CTDE-based MARL by addressing the misalignment risk when applying distributional robustness to multiple agents. It introduces Distributionally Robust IGM (DrIGM), which anchors robust per-agent action values to the robust joint action via a global worst-case model, guaranteeing decentralized greedy actions remain aligned with the robust optimal policy. The authors derive DrIGM-compliant variants of VDN, QMIX, and QTRAN trained with robust targets under $\rho$-contamination and total-variation uncertainty, maintaining scalability and compatibility with existing code. Empirically, DrIGM-based methods improve out-of-distribution performance on SustainGym HVAC tasks and StarCraft II under observation noise, sometimes also enhancing in-distribution stability, demonstrating practical impact for real-world cooperative systems under model uncertainty.

Abstract

Cooperative multi-agent reinforcement learning (MARL) commonly adopts centralized training with decentralized execution, where value-factorization methods enforce the individual-global-maximum (IGM) principle so that decentralized greedy actions recover the team-optimal joint action. However, the reliability of this recipe in real-world settings remains unreliable due to environmental uncertainties arising from the sim-to-real gap, model mismatch, and system noise. We address this gap by introducing Distributionally robust IGM (DrIGM), a principle that requires each agent's robust greedy action to align with the robust team-optimal joint action. We show that DrIGM holds for a novel definition of robust individual action values, which is compatible with decentralized greedy execution and yields a provable robustness guarantee for the whole system. Building on this foundation, we derive DrIGM-compliant robust variants of existing value-factorization architectures (e.g., VDN/QMIX/QTRAN) that (i) train on robust Q-targets, (ii) preserve scalability, and (iii) integrate seamlessly with existing codebases without bespoke per-agent reward shaping. Empirically, on high-fidelity SustainGym simulators and a StarCraft game environment, our methods consistently improve out-of-distribution performance. Code and data are available at https://github.com/crqu/robust-coMARL.

Distributionally Robust Cooperative Multi-Agent Reinforcement Learning via Robust Value Factorization

TL;DR

This work targets robustness in cooperative CTDE-based MARL by addressing the misalignment risk when applying distributional robustness to multiple agents. It introduces Distributionally Robust IGM (DrIGM), which anchors robust per-agent action values to the robust joint action via a global worst-case model, guaranteeing decentralized greedy actions remain aligned with the robust optimal policy. The authors derive DrIGM-compliant variants of VDN, QMIX, and QTRAN trained with robust targets under -contamination and total-variation uncertainty, maintaining scalability and compatibility with existing code. Empirically, DrIGM-based methods improve out-of-distribution performance on SustainGym HVAC tasks and StarCraft II under observation noise, sometimes also enhancing in-distribution stability, demonstrating practical impact for real-world cooperative systems under model uncertainty.

Abstract

Cooperative multi-agent reinforcement learning (MARL) commonly adopts centralized training with decentralized execution, where value-factorization methods enforce the individual-global-maximum (IGM) principle so that decentralized greedy actions recover the team-optimal joint action. However, the reliability of this recipe in real-world settings remains unreliable due to environmental uncertainties arising from the sim-to-real gap, model mismatch, and system noise. We address this gap by introducing Distributionally robust IGM (DrIGM), a principle that requires each agent's robust greedy action to align with the robust team-optimal joint action. We show that DrIGM holds for a novel definition of robust individual action values, which is compatible with decentralized greedy execution and yields a provable robustness guarantee for the whole system. Building on this foundation, we derive DrIGM-compliant robust variants of existing value-factorization architectures (e.g., VDN/QMIX/QTRAN) that (i) train on robust Q-targets, (ii) preserve scalability, and (iii) integrate seamlessly with existing codebases without bespoke per-agent reward shaping. Empirically, on high-fidelity SustainGym simulators and a StarCraft game environment, our methods consistently improve out-of-distribution performance. Code and data are available at https://github.com/crqu/robust-coMARL.
Paper Structure (66 sections, 3 theorems, 54 equations, 6 figures, 3 tables, 7 algorithms)

This paper contains 66 sections, 3 theorems, 54 equations, 6 figures, 3 tables, 7 algorithms.

Key Result

Theorem 1

Given a global uncertainty set $\mathcal{P}$ defined in eq:decpomdp-rect, suppose for all $P\in\mathcal{P}$, there exist $[Q^P_i]_{i \in [N]}$ satisfying def:igm for $Q_{\mathrm{tot}}^{P}$ under joint history $\mathbf{h} = (h_1, \dotsc, h_N) \in \mathcal{H}$. Let denote the global worst-case model and the robust joint greedy action, respectively. For each agent $i \in [N]$, define the robust indi

Figures (6)

  • Figure 1: Overview of our robust value factorization algorithms. Because the robust individual action-value functions satisfy \ref{['def:drigm']}, greedy actions can be computed efficiently in a decentralized manner while the function parameters are trained with a robust TD loss based on global reward.
  • Figure 2: Normalized performance (averaged over 5 independent training runs, with error bars showing standard error) across different environment configurations for our robust MARL algorithms and other baselines. Each panel corresponds to one value factorization method. Robustness gain is the difference in reward (shaded area) between Robust (ours) and Non-robust, which shows the out-of-distribution performance improvement from the robust training.
  • Figure 3: Performance of our robust MARL algorithms and their non-robust baselines in SMAC (3s_vs_5z map). Each algorithm is evaluated every 10,000 environment steps, with each evaluation averaged over 32 episodes. Shaded regions denote the standard error across 5 random seeds. For small $\rho$, the robust algorithms significantly outperform their non-robust counterparts.
  • Figure 4: Improvement in final test win rate of our robust MARL algorithms over their non-robust baselines in SMAC (3s_vs_5z map) for different values of $\rho$. Error bars denote the standard error across 5 random seeds.
  • Figure 5: \ref{['fig:p1']} is the MDP under transition kernel $P_1$, \ref{['fig:p2']} is under $P_2$. The two differ in their transition probabilities to $s_2$ and $s_3$.
  • ...and 1 more figures

Theorems & Definitions (12)

  • Definition 1: IGM
  • Definition 2: DrIGM
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Example 1: Naïve single-agent robust action values cannot guarantee \ref{['def:drigm']}
  • Example 2: \ref{['def:drigm']} can address cases where \ref{['def:igm']} fails.
  • proof
  • Remark 1
  • Remark 2
  • ...and 2 more