Table of Contents
Fetching ...

Diffusion Guided Adversarial State Perturbations in Reinforcement Learning

Xiaolin Sun, Feidi Liu, Zhengming Ding, ZiZhan Zheng

TL;DR

This work addresses the vulnerability of reinforcement learning agents to semantic perturbations in vision-based inputs by introducing SHIFT, a diffusion-guided, policy-agnostic attack. SHIFT leverages classifier-free guidance, policy guidance via the victim's $Q^\pi$, and autoencoder realism to produce states that semantically differ from true states yet remain realistic and history-aligned, thereby evading diffusion-based defenses. Across a range of environments and defenses, SHIFT markedly degrades cumulative rewards while maintaining stealth, highlighting a trilemma between semantic change, historical alignment, and trajectory faithfulness. The results emphasize the need for robust RL policies that can withstand semantics-aware perturbations in real-world deployments.

Abstract

Reinforcement learning (RL) systems, while achieving remarkable success across various domains, are vulnerable to adversarial attacks. This is especially a concern in vision-based environments where minor manipulations of high-dimensional image inputs can easily mislead the agent's behavior. To this end, various defenses have been proposed recently, with state-of-the-art approaches achieving robust performance even under large state perturbations. However, after closer investigation, we found that the effectiveness of the current defenses is due to a fundamental weakness of the existing $l_p$ norm-constrained attacks, which can barely alter the semantics of image input even under a relatively large perturbation budget. In this work, we propose SHIFT, a novel policy-agnostic diffusion-based state perturbation attack to go beyond this limitation. Our attack is able to generate perturbed states that are semantically different from the true states while remaining realistic and history-aligned to avoid detection. Evaluations show that our attack effectively breaks existing defenses, including the most sophisticated ones, significantly outperforming existing attacks while being more perceptually stealthy. The results highlight the vulnerability of RL agents to semantics-aware adversarial perturbations, indicating the importance of developing more robust policies.

Diffusion Guided Adversarial State Perturbations in Reinforcement Learning

TL;DR

This work addresses the vulnerability of reinforcement learning agents to semantic perturbations in vision-based inputs by introducing SHIFT, a diffusion-guided, policy-agnostic attack. SHIFT leverages classifier-free guidance, policy guidance via the victim's , and autoencoder realism to produce states that semantically differ from true states yet remain realistic and history-aligned, thereby evading diffusion-based defenses. Across a range of environments and defenses, SHIFT markedly degrades cumulative rewards while maintaining stealth, highlighting a trilemma between semantic change, historical alignment, and trajectory faithfulness. The results emphasize the need for robust RL policies that can withstand semantics-aware perturbations in real-world deployments.

Abstract

Reinforcement learning (RL) systems, while achieving remarkable success across various domains, are vulnerable to adversarial attacks. This is especially a concern in vision-based environments where minor manipulations of high-dimensional image inputs can easily mislead the agent's behavior. To this end, various defenses have been proposed recently, with state-of-the-art approaches achieving robust performance even under large state perturbations. However, after closer investigation, we found that the effectiveness of the current defenses is due to a fundamental weakness of the existing norm-constrained attacks, which can barely alter the semantics of image input even under a relatively large perturbation budget. In this work, we propose SHIFT, a novel policy-agnostic diffusion-based state perturbation attack to go beyond this limitation. Our attack is able to generate perturbed states that are semantically different from the true states while remaining realistic and history-aligned to avoid detection. Evaluations show that our attack effectively breaks existing defenses, including the most sophisticated ones, significantly outperforming existing attacks while being more perceptually stealthy. The results highlight the vulnerability of RL agents to semantics-aware adversarial perturbations, indicating the importance of developing more robust policies.

Paper Structure

This paper contains 43 sections, 1 theorem, 24 equations, 7 figures, 11 tables, 2 algorithms.

Key Result

Theorem 3.6

The reverse process when sampling from a history-conditioned DDPM model guided by the victim's state-action value function $Q^\pi$ is given by where $\mu_i$ is derived from $\epsilon_i$ in (classifier_free_predictor), as given by (eq:mu), and $\sigma_i^2$ is determined by the variance scheduler $\beta_i$.

Figures (7)

  • Figure 1: A car approaches a crosswalk with a pedestrian ahead. The safe, optimal action is to brake. SHIFT-O removes the pedestrian from the agent’s observation, while SHIFT-I creates an imaginary trajectory suggesting the car has already crossed. Both mislead the agent into moving forward, resulting in a collision.
  • Figure 2: Examples of true and perturbed states captured by the front camera of a vehicle in the AirSim driving simulator. a) is the true state. b) and c) are the perturbed states under the PGD attack with a $l_\infty$ budget of $\frac{15}{255}$ and through rotation korkmaz2023adversarial by 3 degrees counterclockwise, respectively. d) and e) are the perturbed states generated by our SHIFT-O and SHIFT-I attacks, respectively. Neither PGD nor Rotation attacks can alter the decision-related semantics. SHIFT-O removed pedestrians and bicycles at the crosswalk while being aligned with the real history. SHIFT-I lures the driver into thinking that the car has already crossed the crosswalk, when in fact it has not in the real environment. Note that SHIFT-I is history-aligned with the observed trajectory but not the real trajectory.
  • Figure 3: Ablation Study Results. a) shows the rolling average of $l_2$ reconstruction error (from the autoencoder-based realism detector) of our generated perturbed states with and without the realism enhancement. b) shows the $l_2$ reconstruction error, deviation rate under different policy guidance strengths with SHIFT-O attack. c) shows distance between perturbed and true states ($84 \times 84$ grayscale images). (a) and (b) use the vanilla DQN policy.
  • Figure 4: Pipelines of SHIFT's two stages. a) shows the training stage where the attacker uses clean data to train a history-conditioned diffusion model and an autoencoder-based anomaly detector. b) shows the testing stage where the attacker perturbs the true state through the reverse sampling process of the pre-trained conditional diffusion model guided by the gradient of the victim's policy and that of the autoencoder's reconstruction loss.
  • Figure 5: Extra examples of true and perturbed states of Atari Freeway and Doom. a) is the true state. b) and c) are the perturbed states under the PGD attack with a $l_\infty$ budget of $\frac{15}{255}$ and through rotation korkmaz2023adversarial by 3 degrees counterclockwise, respectively. d) and e) are the perturbed states generated by our SHIFT-O and SHIFT-I attacks, respectively. Neither PGD nor Rotation attacks can alter the decision-related semantics.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Definition 3.1: Valid States
  • Definition 3.2: Realistic States
  • Definition 3.3: Semantics-Changing States
  • Definition 3.4: History-Aligned States
  • Definition 3.5: Trajectory Faithfulness
  • Theorem 3.6