Table of Contents
Fetching ...

Belief-Enriched Pessimistic Q-Learning against Adversarial State Perturbations

Xiaolin Sun, Zizhan Zheng

TL;DR

This work proposes a new robust RL algorithm for deriving a pessimistic policy to safeguard against an agent's uncertainty about true states and is further enhanced with belief state inference and diffusion-based state purification to reduce uncertainty.

Abstract

Reinforcement learning (RL) has achieved phenomenal success in various domains. However, its data-driven nature also introduces new vulnerabilities that can be exploited by malicious opponents. Recent work shows that a well-trained RL agent can be easily manipulated by strategically perturbing its state observations at the test stage. Existing solutions either introduce a regularization term to improve the smoothness of the trained policy against perturbations or alternatively train the agent's policy and the attacker's policy. However, the former does not provide sufficient protection against strong attacks, while the latter is computationally prohibitive for large environments. In this work, we propose a new robust RL algorithm for deriving a pessimistic policy to safeguard against an agent's uncertainty about true states. This approach is further enhanced with belief state inference and diffusion-based state purification to reduce uncertainty. Empirical results show that our approach obtains superb performance under strong attacks and has a comparable training overhead with regularization-based methods. Our code is available at https://github.com/SliencerX/Belief-enriched-robust-Q-learning.

Belief-Enriched Pessimistic Q-Learning against Adversarial State Perturbations

TL;DR

This work proposes a new robust RL algorithm for deriving a pessimistic policy to safeguard against an agent's uncertainty about true states and is further enhanced with belief state inference and diffusion-based state purification to reduce uncertainty.

Abstract

Reinforcement learning (RL) has achieved phenomenal success in various domains. However, its data-driven nature also introduces new vulnerabilities that can be exploited by malicious opponents. Recent work shows that a well-trained RL agent can be easily manipulated by strategically perturbing its state observations at the test stage. Existing solutions either introduce a regularization term to improve the smoothness of the trained policy against perturbations or alternatively train the agent's policy and the attacker's policy. However, the former does not provide sufficient protection against strong attacks, while the latter is computationally prohibitive for large environments. In this work, we propose a new robust RL algorithm for deriving a pessimistic policy to safeguard against an agent's uncertainty about true states. This approach is further enhanced with belief state inference and diffusion-based state purification to reduce uncertainty. Empirical results show that our approach obtains superb performance under strong attacks and has a comparable training overhead with regularization-based methods. Our code is available at https://github.com/SliencerX/Belief-enriched-robust-Q-learning.
Paper Structure (36 sections, 5 theorems, 23 equations, 12 figures, 7 tables)

This paper contains 36 sections, 5 theorems, 23 equations, 12 figures, 7 tables.

Key Result

Theorem 1

The gap between $Q^{\Tilde{\pi}_n}$ and $Q^*$ is bounded by where $\Tilde{\pi}_n$ is obtained by Algorithm Q-iteration and $\Delta = 2\epsilon\gamma(l_r+l_p|S|\frac{R_{max}}{1-\gamma})$.

Figures (12)

  • Figure 1: Examples of perturbed states : (a) and (b) show states in a continuous state Gridworld, and (c) and (d) show states in the Atari Pong game.
  • Figure 2: Belief-enriched robust RL against state perturbations. Note that the agent can only access the true state $s_t$ and reward $R_t$ at the training stage.
  • Figure 3: Pessimistic Q-Learning
  • Figure 4: True, perturbed, and worst-case states in Algorithm \ref{['Q-learning']} and belief update. Beginning with true state $s_0$ and perturbed state $\tilde{s}_0$, the agent will have an initial belief, i.e., the $\epsilon$ ball centered at $\tilde{s}_0$. After taking action $a_0$, the belief is updated to the region marked by the purple ball. When observing the next perturbed state $\tilde{s_1}$, the agent will update belief by taking the intersection of the purple ball and the green ball.
  • Figure 5: An example of valid vs. invalid states in Pong.
  • ...and 7 more figures

Theorems & Definitions (12)

  • Definition 1
  • Definition 2
  • Definition 3
  • Theorem 1
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • ...and 2 more