Robust off-policy Reinforcement Learning via Soft Constrained Adversary
Kosuke Nakanishi, Akihiro Kubo, Yuji Yasui, Shin Ishii
TL;DR
This work addresses robust reinforcement learning under long-horizon adversaries by moving beyond traditional $L_p$-norm perturbations and introducing an $f$-divergence constrained attack framework that leverages prior knowledge of perturbation distributions. It derives two attack schemes, SofA and EpsA, and instantiates them into two off-policy SAC-based methods, SofA-SAC and EpsA-SAC, with theoretical contraction and policy-improvement guarantees. Empirical results on four MuJoCo tasks show that SofA-SAC and EpsA-SAC achieve strong robustness to Gaussian and strong $L_{\infty}$ perturbations while maintaining competitive or superior performance relative to SA-SAC and other baselines. The proposed framework provides a flexible, sample-efficient approach to robust RL that can incorporate realistic noise models and extend to domain-randomization-type settings, albeit with higher computational cost due to adversarial considerations.
Abstract
Recently, robust reinforcement learning (RL) methods against input observation have garnered significant attention and undergone rapid evolution due to RL's potential vulnerability. Although these advanced methods have achieved reasonable success, there have been two limitations when considering adversary in terms of long-term horizons. First, the mutual dependency between the policy and its corresponding optimal adversary limits the development of off-policy RL algorithms; although obtaining optimal adversary should depend on the current policy, this has restricted applications to off-policy RL. Second, these methods generally assume perturbations based only on the $L_p$-norm, even when prior knowledge of the perturbation distribution in the environment is available. We here introduce another perspective on adversarial RL: an f-divergence constrained problem with the prior knowledge distribution. From this, we derive two typical attacks and their corresponding robust learning frameworks. The evaluation of robustness is conducted and the results demonstrate that our proposed methods achieve excellent performance in sample-efficient off-policy RL.
