Toward Evaluating Robustness of Reinforcement Learning with Adversarial Policy

Xiang Zheng; Xingjun Ma; Shengjie Wang; Xinyu Wang; Chao Shen; Cong Wang

Toward Evaluating Robustness of Reinforcement Learning with Adversarial Policy

Xiang Zheng, Xingjun Ma, Shengjie Wang, Xinyu Wang, Chao Shen, Cong Wang

TL;DR

This work addresses the vulnerability of reinforcement learning agents to test-time evasion by learning adversarial policies in a black-box setting. It introduces Intrinsically Motivated Adversarial Policy (IMAP), a regularizer-based framework that adds four adversarial intrinsic regularizers—state coverage, policy coverage, risk, and divergence—plus an adaptive bias-reduction mechanism to balance exploration and exploitation. Using a Frank-Wolfe optimization approach within a PPO-based policy update, IMAP achieves strong attacking performance across dense- and sparse-reward single-agent tasks and competitive multi-agent games, often outperforming state-of-the-art baselines and defeating defenses like adversarial training and robust regularizers. The results demonstrate the value of principled intrinsic motivation for red-teaming robust RL systems and provide practical insights into selecting regularizers and adaptive balancing strategies. The work offers a substantial step toward systematic, black-box robustness evaluation and suggests avenues for future defense-oriented research and broader red-teaming applications in RL and related domains.

Abstract

Reinforcement learning agents are susceptible to evasion attacks during deployment. In single-agent environments, these attacks can occur through imperceptible perturbations injected into the inputs of the victim policy network. In multi-agent environments, an attacker can manipulate an adversarial opponent to influence the victim policy's observations indirectly. While adversarial policies offer a promising technique to craft such attacks, current methods are either sample-inefficient due to poor exploration strategies or require extra surrogate model training under the black-box assumption. To address these challenges, in this paper, we propose Intrinsically Motivated Adversarial Policy (IMAP) for efficient black-box adversarial policy learning in both single- and multi-agent environments. We formulate four types of adversarial intrinsic regularizers -- maximizing the adversarial state coverage, policy coverage, risk, or divergence -- to discover potential vulnerabilities of the victim policy in a principled way. We also present a novel bias-reduction method to balance the extrinsic objective and the adversarial intrinsic regularizers adaptively. Our experiments validate the effectiveness of the four types of adversarial intrinsic regularizers and the bias-reduction method in enhancing black-box adversarial policy learning across a variety of environments. Our IMAP successfully evades two types of defense methods, adversarial training and robust regularizer, decreasing the performance of the state-of-the-art robust WocaR-PPO agents by 34\%-54\% across four single-agent tasks. IMAP also achieves a state-of-the-art attacking success rate of 83.91\% in the multi-agent game YouShallNotPass. Our code is available at \url{https://github.com/x-zheng16/IMAP}.

Toward Evaluating Robustness of Reinforcement Learning with Adversarial Policy

TL;DR

Abstract

Paper Structure (49 sections, 17 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 49 sections, 17 equations, 7 figures, 2 tables, 1 algorithm.

Introduction
Background
Motivations and Design Rationale
Motivations
Design Rationale
Preliminaries
Single-Agent RL Tasks
Multi-Agent RL Tasks
Policy Optimization
Threat Model
Objective of the Adversary
Knowledge of the Adversary
Capabilities of the Adversary
Single-Agent RL Tasks
Multi-Agent RL Tasks
...and 34 more sections

Figures (7)

Figure 1: The robust victim agent—trained with the state-of-the-art defense method WocaR liang2022efficient—is attacked by (top) the state-of-the-art AP method SA-RL and (bottom) our IMAP in the single-agent environment Walker. Though the WocaR Walker learned to lower its body to be robust, our IMAP can find its vulnerable states and successfully lure the victim to lean forward and fall.
Figure 2: The victim (in blue) is attacked by an adversarial opponent (in red) in the multi-agent environment YouShallNotPass. The adversary is trained via (top) AP-MARL or (bottom) IMAP. AP-MARL learns to statically collapse on the ground and fail to block the victim. In contrast, our IMAP learns a stronger adversarial skill to intercept the victim.
Figure 3: Rendered pictures of typical MuJoCo environments. \ref{['fig: env-a']} the locomotion environment Ant; \ref{['fig: env-b']} the navigation environment AntUMaze where the red point is the goal position; \ref{['fig: env-c']} the manipulation environment FetchReach where the red point is the goal position; \ref{['fig: env-d']} the two-player zero-sum competitive game YouShallNotPass where the blue human is the victim and the red is the adversary.
Figure 4: Curves of test-time attacking results of SA-RL and four types of IMAP attacks on six sparse-reward locomotion tasks. IMAP-R significantly outperforms SA-RL in SparseHopper and SpareWalker2d; IMAP-PC significantly surpasses SA-RL in SparseHalfCheetah and SparseHumanoidStandup.
Figure 5: Learning curves of AP-MARL and IMAP-PC+BR in two-player zero-sum competitive games. IMAP-PC+BR outperforms AP-MARL by a large margin.
...and 2 more figures

Toward Evaluating Robustness of Reinforcement Learning with Adversarial Policy

TL;DR

Abstract

Toward Evaluating Robustness of Reinforcement Learning with Adversarial Policy

Authors

TL;DR

Abstract

Table of Contents

Figures (7)