Table of Contents
Fetching ...

Provably Efficient Action-Manipulation Attack Against Continuous Reinforcement Learning

Zhi Luo, Xiyuan Yang, Pan Zhou, Di Wang

TL;DR

A black-box attack algorithm named LCBT is proposed, which uses the Monte Carlo tree search method for efficient action searching and manipulation and demonstrates that for an agent whose dynamic regret is sub-linearly related to the total number of steps, LCBT can teach the agent to converge to target policies with only sublinear attack cost.

Abstract

Manipulating the interaction trajectories between the intelligent agent and the environment can control the agent's training and behavior, exposing the potential vulnerabilities of reinforcement learning (RL). For example, in Cyber-Physical Systems (CPS) controlled by RL, the attacker can manipulate the actions of the adopted RL to other actions during the training phase, which will lead to bad consequences. Existing work has studied action-manipulation attacks in tabular settings, where the states and actions are discrete. As seen in many up-and-coming RL applications, such as autonomous driving, continuous action space is widely accepted, however, its action-manipulation attacks have not been thoroughly investigated yet. In this paper, we consider this crucial problem in both white-box and black-box scenarios. Specifically, utilizing the knowledge derived exclusively from trajectories, we propose a black-box attack algorithm named LCBT, which uses the Monte Carlo tree search method for efficient action searching and manipulation. Additionally, we demonstrate that for an agent whose dynamic regret is sub-linearly related to the total number of steps, LCBT can teach the agent to converge to target policies with only sublinear attack cost, i.e., $O\left(\mathcal{R}(T) + MH^3K^E\log (MT)\right)(0<E<1)$, where $H$ is the number of steps per episode, $K$ is the total number of episodes, $T=KH$ is the total number of steps, $M$ is the number of subspaces divided in the state space, and $\mathcal{R}(T)$ is the bound of the RL algorithm's regret. We conduct our proposed attack methods on three aggressive algorithms: DDPG, PPO, and TD3 in continuous settings, which show a promising attack performance.

Provably Efficient Action-Manipulation Attack Against Continuous Reinforcement Learning

TL;DR

A black-box attack algorithm named LCBT is proposed, which uses the Monte Carlo tree search method for efficient action searching and manipulation and demonstrates that for an agent whose dynamic regret is sub-linearly related to the total number of steps, LCBT can teach the agent to converge to target policies with only sublinear attack cost.

Abstract

Manipulating the interaction trajectories between the intelligent agent and the environment can control the agent's training and behavior, exposing the potential vulnerabilities of reinforcement learning (RL). For example, in Cyber-Physical Systems (CPS) controlled by RL, the attacker can manipulate the actions of the adopted RL to other actions during the training phase, which will lead to bad consequences. Existing work has studied action-manipulation attacks in tabular settings, where the states and actions are discrete. As seen in many up-and-coming RL applications, such as autonomous driving, continuous action space is widely accepted, however, its action-manipulation attacks have not been thoroughly investigated yet. In this paper, we consider this crucial problem in both white-box and black-box scenarios. Specifically, utilizing the knowledge derived exclusively from trajectories, we propose a black-box attack algorithm named LCBT, which uses the Monte Carlo tree search method for efficient action searching and manipulation. Additionally, we demonstrate that for an agent whose dynamic regret is sub-linearly related to the total number of steps, LCBT can teach the agent to converge to target policies with only sublinear attack cost, i.e., , where is the number of steps per episode, is the total number of episodes, is the total number of steps, is the number of subspaces divided in the state space, and is the bound of the RL algorithm's regret. We conduct our proposed attack methods on three aggressive algorithms: DDPG, PPO, and TD3 in continuous settings, which show a promising attack performance.

Paper Structure

This paper contains 22 sections, 7 theorems, 45 equations, 8 figures, 5 tables, 3 algorithms.

Key Result

Lemma 1

If $\Delta_{min} > 0$ and the attacker follows the oracle attack scheme, then from the agent's perspective, $\pi^o$ is the optimal policy.

Figures (8)

  • Figure 1: Action-manipulation attack model.
  • Figure 2: Reward and attack cost results of Environment $1$. In this experiment, $H=10$, $T=3 *10^5$, $r_a = 0.0625$, the corresponding $\mathcal{K}$ for both PPO and DDPG algorithms is 0. In the LCBT attack, the state subspace quantity $M=16$, and $\rho=1/2$.
  • Figure 3: Reward and attack cost results of Environment $2$. In this experiment, $H=10$, $T=10^6$, $r_a = 0.31$, the corresponding $\mathcal{K}$ for both DDPG and TD3 algorithms is 0, $M=81$, and $\rho = 1/\sqrt{2}$.
  • Figure 4: Environment $1$. The objective of this environment is to control the slider to slide on the rod.
  • Figure 5: Environment $2$. The objective of this environment is to control a vehicle to move on a two-dimensional plane.
  • ...and 3 more figures

Theorems & Definitions (11)

  • Lemma 1
  • Theorem 1
  • Remark 1
  • Definition 1
  • Theorem 2
  • Remark 2
  • Remark 3
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • ...and 1 more