Constrained Black-Box Attacks Against Cooperative Multi-Agent Reinforcement Learning
Amine Andam, Jamal Bentahar, Mustapha Hedabou
TL;DR
The paper investigates vulnerabilities of cooperative multi-agent reinforcement learning under deployment in a constrained black-box setting where an adversary can only perturb observed states, not policies or weights. It introduces misalignment-based attacks, notably the Align attack, which trains a predictor $f_{\theta}$ to estimate each agent's observation from others and then applies PGD-style perturbations to maximize a misalignment loss $J(\mathbf{o}+\delta; \theta)$ under $\|\delta\|_{\infty} \le \epsilon$, plus a Targeted Align variant and a Free misalignment attack based on partial Hadamard matrices. Experiments across 22 environments in three MARL benchmarks (SMAC, RWARE, LBF) show that these attacks significantly degrade performance even with limited data (as few as 1{,}000 samples) and under partial observability, with targeted attacks further reducing coordination. The work highlights the plausibility and practicality of such threats and motivates developing defenses against observation-level perturbations in deployed c-MARL systems.
Abstract
Collaborative multi-agent reinforcement learning has rapidly evolved, offering state-of-the-art algorithms for real-world applications, including sensitive domains. However, a key challenge to its widespread adoption is the lack of a thorough investigation into its vulnerabilities to adversarial attacks. Existing work predominantly focuses on training-time attacks or unrealistic scenarios, such as access to policy weights or the ability to train surrogate policies. In this paper, we investigate new vulnerabilities under more challenging and constrained conditions, assuming an adversary can only collect and perturb the observations of deployed agents. We also consider scenarios where the adversary has no access at all (no observations, actions, or weights). Our main approach is to generate perturbations that intentionally misalign how victim agents see their environment. Our approach is empirically validated on three benchmarks and 22 environments, demonstrating its effectiveness across diverse algorithms and environments. Furthermore, we show that our algorithm is sample-efficient, requiring only 1,000 samples compared to the millions needed by previous methods.
