Rethinking Adversarial Attacks in Reinforcement Learning from Policy Distribution Perspective

Tianyang Duan; Zongyuan Zhang; Zheng Lin; Yue Gao; Ling Xiong; Yong Cui; Hongbin Liang; Xianhao Chen; Heming Cui; Dong Huang

Rethinking Adversarial Attacks in Reinforcement Learning from Policy Distribution Perspective

Tianyang Duan, Zongyuan Zhang, Zheng Lin, Yue Gao, Ling Xiong, Yong Cui, Hongbin Liang, Xianhao Chen, Heming Cui, Dong Huang

TL;DR

DRL agents are vulnerable to observation disturbances, but existing attacks focus on single sampled actions and may miss vulnerabilities in the full policy distribution, especially in continuous action spaces. The authors propose Distribution-Aware Projected Gradient Descent (DAPGD), which maximizes a Bhattacharyya-distance-based distribution similarity loss under an $L_p$ constraint to perturb states and influence the entire policy distribution rather than individual actions. They formalize the attack within a policy-distribution framework and derive a gradient-based update rule, demonstrating superior attack performance compared with seven baselines on three Safety Gym navigation tasks and under both benign and defended models. The results indicate that distribution-aware perturbations provide a more robust and realistic assessment of DRL robustness, with significant potential impact on safety-critical AI systems.

Abstract

Deep Reinforcement Learning (DRL) suffers from uncertainties and inaccuracies in the observation signal in realworld applications. Adversarial attack is an effective method for evaluating the robustness of DRL agents. However, existing attack methods targeting individual sampled actions have limited impacts on the overall policy distribution, particularly in continuous action spaces. To address these limitations, we propose the Distribution-Aware Projected Gradient Descent attack (DAPGD). DAPGD uses distribution similarity as the gradient perturbation input to attack the policy network, which leverages the entire policy distribution rather than relying on individual samples. We utilize the Bhattacharyya distance in DAPGD to measure policy similarity, enabling sensitive detection of subtle but critical differences between probability distributions. Our experiment results demonstrate that DAPGD achieves SOTA results compared to the baselines in three robot navigation tasks, achieving an average 22.03% higher reward drop compared to the best baseline.

Rethinking Adversarial Attacks in Reinforcement Learning from Policy Distribution Perspective

TL;DR

constraint to perturb states and influence the entire policy distribution rather than individual actions. They formalize the attack within a policy-distribution framework and derive a gradient-based update rule, demonstrating superior attack performance compared with seven baselines on three Safety Gym navigation tasks and under both benign and defended models. The results indicate that distribution-aware perturbations provide a more robust and realistic assessment of DRL robustness, with significant potential impact on safety-critical AI systems.

Abstract

Paper Structure (9 sections, 5 equations, 2 figures, 2 tables, 1 algorithm)

This paper contains 9 sections, 5 equations, 2 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Methodology
Problem Formalization
Distribution Similarity Projected Gradient Descent
Evaluation
Experimental Setup
Overall performance of DAPGD
Conclusion

Figures (2)

Figure 1: Two methods for generating adversarial examples in the Goal task. In this task, the agent needs to navigate around Hazards and reach the Goal. Top: Existing methods (e.g., PGD) sample from the policy and calculate the sign gradient of mean square error loss to attack. Bottom: Our method (DAPGD) directly utilizes the policy distribution similarity, which calculates the sign gradient of the Bhattacharyya distance between policies to attack.
Figure 2: Average reward obtained by the agent under each attack configuration in Button. Lower rewards indicate more effective attacks.

Rethinking Adversarial Attacks in Reinforcement Learning from Policy Distribution Perspective

TL;DR

Abstract

Rethinking Adversarial Attacks in Reinforcement Learning from Policy Distribution Perspective

Authors

TL;DR

Abstract

Table of Contents

Figures (2)