On Minimizing Adversarial Counterfactual Error in Adversarial RL

Roman Belaire; Arunesh Sinha; Pradeep Varakantham

On Minimizing Adversarial Counterfactual Error in Adversarial RL

Roman Belaire, Arunesh Sinha, Pradeep Varakantham

TL;DR

This work addresses the vulnerability of deep RL policies to adversarial observation perturbations by explicitly modeling partial observability through beliefs about the true state. It introduces Adversarial Counterfactual Error (ACoE) and its scalable surrogate Cumulative-ACoE (C-ACoE), and develops practical surrogates A2B and A3B Beliefs to enable model-free optimization via PPO and DQN. The proposed approach balances maximizing nominal value with minimizing adversarial counterfactual error, and achieves state-of-the-art robustness against greedy and long-horizon adversaries across MuJoCo, Atari, and Highway benchmarks. The results suggest that belief-based robustness, together with efficient surrogates, offers a promising direction for making DRL policies safer in adversarial environments.

Abstract

Deep Reinforcement Learning (DRL) policies are highly susceptible to adversarial noise in observations, which poses significant risks in safety-critical scenarios. The challenge inherent to adversarial perturbations is that by altering the information observed by the agent, the state becomes only partially observable. Existing approaches address this by either enforcing consistent actions across nearby states or maximizing the worst-case value within adversarially perturbed observations. However, the former suffers from performance degradation when attacks succeed, while the latter tends to be overly conservative, leading to suboptimal performance in benign settings. We hypothesize that these limitations stem from their failing to account for partial observability directly. To this end, we introduce a novel objective called Adversarial Counterfactual Error (ACoE), defined on the beliefs about the true state and balancing value optimization with robustness. To make ACoE scalable in model-free settings, we propose the theoretically-grounded surrogate objective Cumulative-ACoE (C-ACoE). Our empirical evaluations on standard benchmarks (MuJoCo, Atari, and Highway) demonstrate that our method significantly outperforms current state-of-the-art approaches for addressing adversarial RL challenges, offering a promising direction for improving robustness in DRL under adversarial conditions. Our code is available at https://github.com/romanbelaire/acoe-robust-rl.

On Minimizing Adversarial Counterfactual Error in Adversarial RL

TL;DR

Abstract

Paper Structure (21 sections, 4 theorems, 33 equations, 5 figures, 11 tables, 2 algorithms)

This paper contains 21 sections, 4 theorems, 33 equations, 5 figures, 11 tables, 2 algorithms.

Introduction
Related work
Adversarial Counterfactual Error (ACoE)
Optimizing C-ACoE along with Non-adversarial Expected Reward in Adversarial RL
Experiments
Experiment setup
Results
Discussion and Limitations
Proofs and Additional Theory Results
Adaptation for DQN
Estimation of Belief for Continuous State Space
Defining ACoE Belief Methods with State Histories
Additional Experimental Results
Long-horizon Adversaries
Empirical Evaluations with Protected-PPO
...and 6 more sections

Key Result

Theorem 3.2

Let $K = \max_{s \in \mathcal{S}} {V(s)}$ and assume $TV (T(\cdot |s_o, a), P_o(\cdot~|~b,a)) \leq \Xi$ for any observed state $s_o$, belief $b$, and action $a$ in the same time step, then

Figures (5)

Figure 1: A3B belief construction. Let the dotted line $\overline{s_is_j}$ have magnitude representing the damage when perturbing $s_i\rightarrow s_j$. In this example, our method should discount the possibility that $\nu(s_2)=s_0$, and lessen the score $z(s_2)$.
Figure 2: Robust agents vs. a Critical Point strategic adversary Sun_Zhang_Xie_Ma_Zheng_Chen_Liu_2020 with increasing search sizes.
Figure 3: Robust agents vs. a Strategically Timed Attack adversary lin:tactics2017, as the length of perturbation increases. We find that as the level of strategy increases from long-horizon attackers, C-ACoE minimization improves robust performance, relative to other methods.
Figure 4: Robust agents vs. a PA-AD attacker sun2023strongest, as the optimality of the attacker policy increases. To represent levels of optimality, we save PA-AD model weights at 5 evenly distributed points across the training epochs. We find that as the level of strategy increases from long-horizon attackers, C-ACoE minimization achieves more robust performance, relative to other methods.
Figure 5: Last 5 frames of PPO, A3B, and WocaR agents (top to bottom), on MuJoCo-HalfCheetah. PPO deviates the least from the dashed center-mass line, and has the least balanced gait. WocaR has arguably the most stable posture when noting the faster front leg recovery of A3B, but our empirical results suggest optimizing maximum stability is not always necessary. Full GIFs: tinyurl.com/a3b-gif

Theorems & Definitions (9)

Definition 3.1: Cumulative Adversarial Counterfactual Error (C-ACoE)
Theorem 3.2
Proposition 4.1
proof : Proof of Theorem \ref{['thm1']}
Theorem A.1
proof : Proof of Theorem \ref{['thm2']}
proof : Proof of Proposition \ref{['prop:bellman']}
Lemma C.1
proof

On Minimizing Adversarial Counterfactual Error in Adversarial RL

TL;DR

Abstract

On Minimizing Adversarial Counterfactual Error in Adversarial RL

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (9)