Distributionally Robust Cooperative Multi-Agent Reinforcement Learning via Robust Value Factorization
Chengrui Qu, Christopher Yeh, Kishan Panaganti, Eric Mazumdar, Adam Wierman
TL;DR
This work targets robustness in cooperative CTDE-based MARL by addressing the misalignment risk when applying distributional robustness to multiple agents. It introduces Distributionally Robust IGM (DrIGM), which anchors robust per-agent action values to the robust joint action via a global worst-case model, guaranteeing decentralized greedy actions remain aligned with the robust optimal policy. The authors derive DrIGM-compliant variants of VDN, QMIX, and QTRAN trained with robust targets under $\rho$-contamination and total-variation uncertainty, maintaining scalability and compatibility with existing code. Empirically, DrIGM-based methods improve out-of-distribution performance on SustainGym HVAC tasks and StarCraft II under observation noise, sometimes also enhancing in-distribution stability, demonstrating practical impact for real-world cooperative systems under model uncertainty.
Abstract
Cooperative multi-agent reinforcement learning (MARL) commonly adopts centralized training with decentralized execution, where value-factorization methods enforce the individual-global-maximum (IGM) principle so that decentralized greedy actions recover the team-optimal joint action. However, the reliability of this recipe in real-world settings remains unreliable due to environmental uncertainties arising from the sim-to-real gap, model mismatch, and system noise. We address this gap by introducing Distributionally robust IGM (DrIGM), a principle that requires each agent's robust greedy action to align with the robust team-optimal joint action. We show that DrIGM holds for a novel definition of robust individual action values, which is compatible with decentralized greedy execution and yields a provable robustness guarantee for the whole system. Building on this foundation, we derive DrIGM-compliant robust variants of existing value-factorization architectures (e.g., VDN/QMIX/QTRAN) that (i) train on robust Q-targets, (ii) preserve scalability, and (iii) integrate seamlessly with existing codebases without bespoke per-agent reward shaping. Empirically, on high-fidelity SustainGym simulators and a StarCraft game environment, our methods consistently improve out-of-distribution performance. Code and data are available at https://github.com/crqu/robust-coMARL.
