Beyond Worst-case Attacks: Robust RL with Adaptive Defense via Non-dominated Policies

Xiangyu Liu; Chenghao Deng; Yanchao Sun; Yongyuan Liang; Furong Huang

Beyond Worst-case Attacks: Robust RL with Adaptive Defense via Non-dominated Policies

Xiangyu Liu, Chenghao Deng, Yanchao Sun, Yongyuan Liang, Furong Huang

TL;DR

This work addresses robustness of reinforcement learning policies under state-adversarial attacks beyond worst-case scenarios. It introduces PROTECTED, a framework that pre-trains a finite set of non-dominated policies $\TildeΠ$ and employs online no-regret adaptation over this set at test time to minimize regret against adaptive attackers, instead of optimizing for a single worst-case policy. The authors prove intrinsic hardness for sublinear regret with unrestricted policy classes and provide iterative, finite-policy discovery methods that guarantee near-optimality up to a gap $δ$, along with practical optimization strategies. Empirical evaluation on Mujoco tasks shows that PROTECTED improves natural performance while maintaining robustness across static and dynamic attacks, with efficient adaptation even for small policy sets. Overall, the approach offers a practical balance between robustness and test-time efficiency by combining training-time policy discovery with online adaptation in a finite policy space.

Abstract

In light of the burgeoning success of reinforcement learning (RL) in diverse real-world applications, considerable focus has been directed towards ensuring RL policies are robust to adversarial attacks during test time. Current approaches largely revolve around solving a minimax problem to prepare for potential worst-case scenarios. While effective against strong attacks, these methods often compromise performance in the absence of attacks or the presence of only weak attacks. To address this, we study policy robustness under the well-accepted state-adversarial attack model, extending our focus beyond only worst-case attacks. We first formalize this task at test time as a regret minimization problem and establish its intrinsic hardness in achieving sublinear regret when the baseline policy is from a general continuous policy class, $Π$. This finding prompts us to \textit{refine} the baseline policy class $Π$ prior to test time, aiming for efficient adaptation within a finite policy class $\TildeΠ$, which can resort to an adversarial bandit subroutine. In light of the importance of a small, finite $\TildeΠ$, we propose a novel training-time algorithm to iteratively discover \textit{non-dominated policies}, forming a near-optimal and minimal $\TildeΠ$, thereby ensuring both robustness and test-time efficiency. Empirical validation on the Mujoco corroborates the superiority of our approach in terms of natural and robust performance, as well as adaptability to various attack scenarios.

Beyond Worst-case Attacks: Robust RL with Adaptive Defense via Non-dominated Policies

TL;DR

and employs online no-regret adaptation over this set at test time to minimize regret against adaptive attackers, instead of optimizing for a single worst-case policy. The authors prove intrinsic hardness for sublinear regret with unrestricted policy classes and provide iterative, finite-policy discovery methods that guarantee near-optimality up to a gap

, along with practical optimization strategies. Empirical evaluation on Mujoco tasks shows that PROTECTED improves natural performance while maintaining robustness across static and dynamic attacks, with efficient adaptation even for small policy sets. Overall, the approach offers a practical balance between robustness and test-time efficiency by combining training-time policy discovery with online adaptation in a finite policy space.

Abstract

. This finding prompts us to \textit{refine} the baseline policy class

prior to test time, aiming for efficient adaptation within a finite policy class

, which can resort to an adversarial bandit subroutine. In light of the importance of a small, finite

, we propose a novel training-time algorithm to iteratively discover \textit{non-dominated policies}, forming a near-optimal and minimal

, thereby ensuring both robustness and test-time efficiency. Empirical validation on the Mujoco corroborates the superiority of our approach in terms of natural and robust performance, as well as adaptability to various attack scenarios.

Paper Structure (33 sections, 9 theorems, 17 equations, 9 figures, 3 tables, 2 algorithms)

This paper contains 33 sections, 9 theorems, 17 equations, 9 figures, 3 tables, 2 algorithms.

Introduction
Related works
Preliminaries
The PROTECTED framework
Online adaptation for adaptive defenses
Pre-training for non-dominated policies via iterative discovery
Implications.
A practical algorithm.
How to attack adaptive victim policies optimally?
Experiments
Experimental setup and baselines
Performance against static attacks
Performance against dynamic attacks
On the scalability of $|\Tilde{\Pi}|$
Concluding remarks and limitations
...and 18 more sections

Key Result

Proposition 4.3

Fix $\alpha\in[0, 1)$. There does not exist an algorithm that produces a sequence of victim policies $\{\pi^t\}_{t\in [T]}$ such that $\mathbb{E}[\operatorname{Regret}(T)] = \operatorname{poly}(S, A, H) T^{\alpha}$ for any $\{v^t\}_{t\in [T]}$.

Figures (9)

Figure 1: Diagram of ourPROTECTEDframework. During training, we iteratively discover non-dominated policies, forming a finite policy class $\Tilde{\Pi}$. The blue area delineates the reward landscape for victims against attackers, denoted as $\{(J(\pi, \nu^1), J(\pi, \nu^2)){\,|\,} \pi\in\Pi\}$. Here, only two attackers are visualized for clarity. The orange area, on the other hand, represents the space of policies that are "dominated" by the discovered policy class $\Tilde{\Pi}$. Dominated policies are those that are outperformed by at least one (mixed) policy in $\Tilde{\Pi}$ across the specified range of attackers. We refer to §\ref{['sec:expl']} for more detailed explanations. During test time, online adaptation mechanisms are activated to adjust the weight of each policy in response to various attack scenarios adaptively.
Figure 2: Online adaptation when facing unknown static attackers. It can be seen that the best policy can be identified quickly and reliably within $800$ episodes or less against different attackers.
Figure 3: Time averaged accumulative rewards during online adaptation against periodic and probabilistic switching attacks on Ant. The shaded area indicates PA-AD attacks are active while the unshaded area indicates no attacks.
Figure 4: Ablation study of $|\Tilde{\Pi}|$ against PA-AD attacks on $4$ environments.
Figure 5: Iteration discovery of non-dominated policies in two dimensions.
...and 4 more figures

Theorems & Definitions (14)

Definition 4.1: Exploitability
Definition 4.2: Regret
Proposition 4.3
Remark 4.4
Proposition 4.5: bubeck2012regret
Definition 4.6
Proposition 4.7
Proposition 4.8
Definition 4.9: Dominated and Non-dominated Policy
Theorem 4.10
...and 4 more

Beyond Worst-case Attacks: Robust RL with Adaptive Defense via Non-dominated Policies

TL;DR

Abstract

Beyond Worst-case Attacks: Robust RL with Adaptive Defense via Non-dominated Policies

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (14)