Table of Contents
Fetching ...

Robust off-policy Reinforcement Learning via Soft Constrained Adversary

Kosuke Nakanishi, Akihiro Kubo, Yuji Yasui, Shin Ishii

TL;DR

This work addresses robust reinforcement learning under long-horizon adversaries by moving beyond traditional $L_p$-norm perturbations and introducing an $f$-divergence constrained attack framework that leverages prior knowledge of perturbation distributions. It derives two attack schemes, SofA and EpsA, and instantiates them into two off-policy SAC-based methods, SofA-SAC and EpsA-SAC, with theoretical contraction and policy-improvement guarantees. Empirical results on four MuJoCo tasks show that SofA-SAC and EpsA-SAC achieve strong robustness to Gaussian and strong $L_{\infty}$ perturbations while maintaining competitive or superior performance relative to SA-SAC and other baselines. The proposed framework provides a flexible, sample-efficient approach to robust RL that can incorporate realistic noise models and extend to domain-randomization-type settings, albeit with higher computational cost due to adversarial considerations.

Abstract

Recently, robust reinforcement learning (RL) methods against input observation have garnered significant attention and undergone rapid evolution due to RL's potential vulnerability. Although these advanced methods have achieved reasonable success, there have been two limitations when considering adversary in terms of long-term horizons. First, the mutual dependency between the policy and its corresponding optimal adversary limits the development of off-policy RL algorithms; although obtaining optimal adversary should depend on the current policy, this has restricted applications to off-policy RL. Second, these methods generally assume perturbations based only on the $L_p$-norm, even when prior knowledge of the perturbation distribution in the environment is available. We here introduce another perspective on adversarial RL: an f-divergence constrained problem with the prior knowledge distribution. From this, we derive two typical attacks and their corresponding robust learning frameworks. The evaluation of robustness is conducted and the results demonstrate that our proposed methods achieve excellent performance in sample-efficient off-policy RL.

Robust off-policy Reinforcement Learning via Soft Constrained Adversary

TL;DR

This work addresses robust reinforcement learning under long-horizon adversaries by moving beyond traditional -norm perturbations and introducing an -divergence constrained attack framework that leverages prior knowledge of perturbation distributions. It derives two attack schemes, SofA and EpsA, and instantiates them into two off-policy SAC-based methods, SofA-SAC and EpsA-SAC, with theoretical contraction and policy-improvement guarantees. Empirical results on four MuJoCo tasks show that SofA-SAC and EpsA-SAC achieve strong robustness to Gaussian and strong perturbations while maintaining competitive or superior performance relative to SA-SAC and other baselines. The proposed framework provides a flexible, sample-efficient approach to robust RL that can incorporate realistic noise models and extend to domain-randomization-type settings, albeit with higher computational cost due to adversarial considerations.

Abstract

Recently, robust reinforcement learning (RL) methods against input observation have garnered significant attention and undergone rapid evolution due to RL's potential vulnerability. Although these advanced methods have achieved reasonable success, there have been two limitations when considering adversary in terms of long-term horizons. First, the mutual dependency between the policy and its corresponding optimal adversary limits the development of off-policy RL algorithms; although obtaining optimal adversary should depend on the current policy, this has restricted applications to off-policy RL. Second, these methods generally assume perturbations based only on the -norm, even when prior knowledge of the perturbation distribution in the environment is available. We here introduce another perspective on adversarial RL: an f-divergence constrained problem with the prior knowledge distribution. From this, we derive two typical attacks and their corresponding robust learning frameworks. The evaluation of robustness is conducted and the results demonstrate that our proposed methods achieve excellent performance in sample-efficient off-policy RL.
Paper Structure (75 sections, 3 theorems, 62 equations, 12 figures, 4 tables, 4 algorithms)

This paper contains 75 sections, 3 theorems, 62 equations, 12 figures, 4 tables, 4 algorithms.

Key Result

Theorem 1

The Soft Worst Bellman Operator $\underline{\mathcal{T}}^{\pi}_{soft}$ acts as a contraction operator for a fixed policy.

Figures (12)

  • Figure 1: Robustness evaluation results of SofA-SAC and baseline algorithms under the Gaussian based attacks. Each boxplot depicts the 25%, 50%, and 75% percentile values of the mean returns.
  • Figure 2: Robustness evaluation results of EpsA-SAC and baseline algorithms under the $L_{\infty}$-norm attacks. Each boxplot depicts the 25%, 50%, and 75% percentile values of the mean returns.
  • Figure 3: Ablation results for SofA-SAC's hyperparameter. We change sample number and worst preference parameter, from $N=64$ to $1$ and from $\alpha_{attk}=4$ to $2048$.
  • Figure 4: Ablation results for EpsA-SAC's training strategy. We omit adversarial perturbation during policy improvement and Q updating in training.
  • Figure 5: Learning curves for SofA-SAC and baseline algorithms on four MuJoCo control tasks. The solid lines represent the average evaluation scores, and the shaded areas indicate the standard deviation.
  • ...and 7 more figures

Theorems & Definitions (4)

  • Definition 1: Soft Constrained Optimal Adversary on State Observation
  • Theorem 1
  • Theorem 2: Policy Improvement Theorem with a Fixed Adversary
  • Theorem 3