Table of Contents
Fetching ...

Robust Multi-Agent Reinforcement Learning by Mutual Information Regularization

Simin Li, Ruixiao Xu, Jingqiao Xiu, Yuwei Zheng, Pu Feng, Yaodong Yang, Xianglong Liu

TL;DR

This work addresses the challenge of robustness in multi-agent reinforcement learning under unpredictable ally actions. It reframes robustness as an inference problem and introduces MIR3, a mutual-information regularization term that, when added to standard MARL objectives, provably upper-bounds and enhances robustness against unseen worst-case adversaries via off-policy evaluation. The method acts as an information bottleneck and fosters a robust action prior, enabling agents to avoid overreacting to uncertain allies while preserving cooperative performance. Empirically, MIR3 improves robustness and training efficiency on StarCraft II and robot rendezvous, and demonstrates real-world gains of 14.29% over strong baselines, indicating strong practical impact for real-world MARL deployments.

Abstract

In multi-agent reinforcement learning (MARL), ensuring robustness against unpredictable or worst-case actions by allies is crucial for real-world deployment. Existing robust MARL methods either approximate or enumerate all possible threat scenarios against worst-case adversaries, leading to computational intensity and reduced robustness. In contrast, human learning efficiently acquires robust behaviors in daily life without preparing for every possible threat. Inspired by this, we frame robust MARL as an inference problem, with worst-case robustness implicitly optimized under all threat scenarios via off-policy evaluation. Within this framework, we demonstrate that Mutual Information Regularization as Robust Regularization (MIR3) during routine training is guaranteed to maximize a lower bound on robustness, without the need for adversaries. Further insights show that MIR3 acts as an information bottleneck, preventing agents from over-reacting to others and aligning policies with robust action priors. In the presence of worst-case adversaries, our MIR3 significantly surpasses baseline methods in robustness and training efficiency while maintaining cooperative performance in StarCraft II and robot swarm control. When deploying the robot swarm control algorithm in the real world, our method also outperforms the best baseline by 14.29%.

Robust Multi-Agent Reinforcement Learning by Mutual Information Regularization

TL;DR

This work addresses the challenge of robustness in multi-agent reinforcement learning under unpredictable ally actions. It reframes robustness as an inference problem and introduces MIR3, a mutual-information regularization term that, when added to standard MARL objectives, provably upper-bounds and enhances robustness against unseen worst-case adversaries via off-policy evaluation. The method acts as an information bottleneck and fosters a robust action prior, enabling agents to avoid overreacting to uncertain allies while preserving cooperative performance. Empirically, MIR3 improves robustness and training efficiency on StarCraft II and robot rendezvous, and demonstrates real-world gains of 14.29% over strong baselines, indicating strong practical impact for real-world MARL deployments.

Abstract

In multi-agent reinforcement learning (MARL), ensuring robustness against unpredictable or worst-case actions by allies is crucial for real-world deployment. Existing robust MARL methods either approximate or enumerate all possible threat scenarios against worst-case adversaries, leading to computational intensity and reduced robustness. In contrast, human learning efficiently acquires robust behaviors in daily life without preparing for every possible threat. Inspired by this, we frame robust MARL as an inference problem, with worst-case robustness implicitly optimized under all threat scenarios via off-policy evaluation. Within this framework, we demonstrate that Mutual Information Regularization as Robust Regularization (MIR3) during routine training is guaranteed to maximize a lower bound on robustness, without the need for adversaries. Further insights show that MIR3 acts as an information bottleneck, preventing agents from over-reacting to others and aligning policies with robust action priors. In the presence of worst-case adversaries, our MIR3 significantly surpasses baseline methods in robustness and training efficiency while maintaining cooperative performance in StarCraft II and robot swarm control. When deploying the robot swarm control algorithm in the real world, our method also outperforms the best baseline by 14.29%.
Paper Structure (22 sections, 1 theorem, 22 equations, 13 figures, 7 tables, 1 algorithm)

This paper contains 22 sections, 1 theorem, 22 equations, 13 figures, 7 tables, 1 algorithm.

Key Result

Proposition 3.1

$J(\pi) \geq \sum_{t=1}^T \mathbb E_{\tau^0 \sim \hat{p}(\tau^0)}[r_t - \lambda I(\mathbf{h}_t; \mathbf{a}_t)]$, where $\lambda$ is a hyperparameterIn principle, we do not need $\lambda$ since it can be absorbed into reward function. Here we make it explicit to represent the tradeoff between reward

Figures (13)

  • Figure 1: Our policies are learned under routine scenarios but are provably robust against unseen worst-case adversaries through robust regularization, contrasting with existing approaches that require exposure to all possible threat scenarios.
  • Figure 2: MIR3 as information bottleneck, eliminating spurious correlations in histories and mitigating overreactions to agents with action uncertainties, forming robust agent-wise interactions.
  • Figure 3: MIR3 as robust action prior. The objective bias policy to effective actions in the environment, and fosters exploration around this action prior to handle task variations and uncertainties.
  • Figure 4: Cooperative and robust performance on six SMAC tasks, evaluated on MADDPG and QMIX backbones. While never seen adversaries, our MIR3 approach outperforms baselines that explicitly consider threat scenarios. Results reported on 5 seeds for cooperative and $5 \times N$ seeds for adversary scenarios with 95% confidence interval.
  • Figure 5: Agent behaviors under attack in task 4m vs 3m, adversary denoted by red square. Under MADDPG backbone, baselines are either swayed by adversaries or lack cooperation on focused fire. Under QMIX backbone, baselines are frequently swayed without attack. In contrast, our MIR3 is now swayed by adversary and preserves cooperation on focused fire.
  • ...and 8 more figures

Theorems & Definitions (1)

  • Proposition 3.1