Robust off-policy Reinforcement Learning via Soft Constrained Adversary

Kosuke Nakanishi; Akihiro Kubo; Yuji Yasui; Shin Ishii

Robust off-policy Reinforcement Learning via Soft Constrained Adversary

Kosuke Nakanishi, Akihiro Kubo, Yuji Yasui, Shin Ishii

TL;DR

This work addresses robust reinforcement learning under long-horizon adversaries by moving beyond traditional $L_p$-norm perturbations and introducing an $f$-divergence constrained attack framework that leverages prior knowledge of perturbation distributions. It derives two attack schemes, SofA and EpsA, and instantiates them into two off-policy SAC-based methods, SofA-SAC and EpsA-SAC, with theoretical contraction and policy-improvement guarantees. Empirical results on four MuJoCo tasks show that SofA-SAC and EpsA-SAC achieve strong robustness to Gaussian and strong $L_{\infty}$ perturbations while maintaining competitive or superior performance relative to SA-SAC and other baselines. The proposed framework provides a flexible, sample-efficient approach to robust RL that can incorporate realistic noise models and extend to domain-randomization-type settings, albeit with higher computational cost due to adversarial considerations.

Abstract

Recently, robust reinforcement learning (RL) methods against input observation have garnered significant attention and undergone rapid evolution due to RL's potential vulnerability. Although these advanced methods have achieved reasonable success, there have been two limitations when considering adversary in terms of long-term horizons. First, the mutual dependency between the policy and its corresponding optimal adversary limits the development of off-policy RL algorithms; although obtaining optimal adversary should depend on the current policy, this has restricted applications to off-policy RL. Second, these methods generally assume perturbations based only on the $L_p$-norm, even when prior knowledge of the perturbation distribution in the environment is available. We here introduce another perspective on adversarial RL: an f-divergence constrained problem with the prior knowledge distribution. From this, we derive two typical attacks and their corresponding robust learning frameworks. The evaluation of robustness is conducted and the results demonstrate that our proposed methods achieve excellent performance in sample-efficient off-policy RL.

Robust off-policy Reinforcement Learning via Soft Constrained Adversary

TL;DR

This work addresses robust reinforcement learning under long-horizon adversaries by moving beyond traditional

-norm perturbations and introducing an

-divergence constrained attack framework that leverages prior knowledge of perturbation distributions. It derives two attack schemes, SofA and EpsA, and instantiates them into two off-policy SAC-based methods, SofA-SAC and EpsA-SAC, with theoretical contraction and policy-improvement guarantees. Empirical results on four MuJoCo tasks show that SofA-SAC and EpsA-SAC achieve strong robustness to Gaussian and strong

perturbations while maintaining competitive or superior performance relative to SA-SAC and other baselines. The proposed framework provides a flexible, sample-efficient approach to robust RL that can incorporate realistic noise models and extend to domain-randomization-type settings, albeit with higher computational cost due to adversarial considerations.

Abstract

-norm, even when prior knowledge of the perturbation distribution in the environment is available. We here introduce another perspective on adversarial RL: an f-divergence constrained problem with the prior knowledge distribution. From this, we derive two typical attacks and their corresponding robust learning frameworks. The evaluation of robustness is conducted and the results demonstrate that our proposed methods achieve excellent performance in sample-efficient off-policy RL.

Paper Structure (75 sections, 3 theorems, 62 equations, 12 figures, 4 tables, 4 algorithms)

This paper contains 75 sections, 3 theorems, 62 equations, 12 figures, 4 tables, 4 algorithms.

Introduction
Related Work
Adversarial Attack and Defense on State Observations
Preliminaries and Background
Notations
Max-Entropy Off-Policy Actor Critic Algorithm
Reinforcement Learning under Adversarial Attack on State Observation
Methodology
Soft Constrained Representation of Adversarial Attack on State Observation
Soft Worst Attack (SofA) Sampling Method for the KL-divergence Constraint
Epsilon Worst Approximation Attack (EpsA) for the $\alpha$-divergence Constraint
Robust off-policy Reinforcement Learning via Soft Constrained Adversary
Soft Worst Max-Entropy Reinforcement Learning (SofA-SAC)
Epsilon Worst Max-Entropy Reinforcement Learning (EpsA-SAC)
Experiments
...and 60 more sections

Key Result

Theorem 1

The Soft Worst Bellman Operator $\underline{\mathcal{T}}^{\pi}_{soft}$ acts as a contraction operator for a fixed policy.

Figures (12)

Figure 1: Robustness evaluation results of SofA-SAC and baseline algorithms under the Gaussian based attacks. Each boxplot depicts the 25%, 50%, and 75% percentile values of the mean returns.
Figure 2: Robustness evaluation results of EpsA-SAC and baseline algorithms under the $L_{\infty}$-norm attacks. Each boxplot depicts the 25%, 50%, and 75% percentile values of the mean returns.
Figure 3: Ablation results for SofA-SAC's hyperparameter. We change sample number and worst preference parameter, from $N=64$ to $1$ and from $\alpha_{attk}=4$ to $2048$.
Figure 4: Ablation results for EpsA-SAC's training strategy. We omit adversarial perturbation during policy improvement and Q updating in training.
Figure 5: Learning curves for SofA-SAC and baseline algorithms on four MuJoCo control tasks. The solid lines represent the average evaluation scores, and the shaded areas indicate the standard deviation.
...and 7 more figures

Theorems & Definitions (4)

Definition 1: Soft Constrained Optimal Adversary on State Observation
Theorem 1
Theorem 2: Policy Improvement Theorem with a Fixed Adversary
Theorem 3

Robust off-policy Reinforcement Learning via Soft Constrained Adversary

TL;DR

Abstract

Robust off-policy Reinforcement Learning via Soft Constrained Adversary

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (4)