On Minimizing Adversarial Counterfactual Error in Adversarial RL
Roman Belaire, Arunesh Sinha, Pradeep Varakantham
TL;DR
This work addresses the vulnerability of deep RL policies to adversarial observation perturbations by explicitly modeling partial observability through beliefs about the true state. It introduces Adversarial Counterfactual Error (ACoE) and its scalable surrogate Cumulative-ACoE (C-ACoE), and develops practical surrogates A2B and A3B Beliefs to enable model-free optimization via PPO and DQN. The proposed approach balances maximizing nominal value with minimizing adversarial counterfactual error, and achieves state-of-the-art robustness against greedy and long-horizon adversaries across MuJoCo, Atari, and Highway benchmarks. The results suggest that belief-based robustness, together with efficient surrogates, offers a promising direction for making DRL policies safer in adversarial environments.
Abstract
Deep Reinforcement Learning (DRL) policies are highly susceptible to adversarial noise in observations, which poses significant risks in safety-critical scenarios. The challenge inherent to adversarial perturbations is that by altering the information observed by the agent, the state becomes only partially observable. Existing approaches address this by either enforcing consistent actions across nearby states or maximizing the worst-case value within adversarially perturbed observations. However, the former suffers from performance degradation when attacks succeed, while the latter tends to be overly conservative, leading to suboptimal performance in benign settings. We hypothesize that these limitations stem from their failing to account for partial observability directly. To this end, we introduce a novel objective called Adversarial Counterfactual Error (ACoE), defined on the beliefs about the true state and balancing value optimization with robustness. To make ACoE scalable in model-free settings, we propose the theoretically-grounded surrogate objective Cumulative-ACoE (C-ACoE). Our empirical evaluations on standard benchmarks (MuJoCo, Atari, and Highway) demonstrate that our method significantly outperforms current state-of-the-art approaches for addressing adversarial RL challenges, offering a promising direction for improving robustness in DRL under adversarial conditions. Our code is available at https://github.com/romanbelaire/acoe-robust-rl.
