Table of Contents
Fetching ...

Adversarial Agents: Black-Box Evasion Attacks with Reinforcement Learning

Kyle Domico, Jean-Charles Noirot Ferrand, Ryan Sheatsley, Eric Pauley, Josiah Hanna, Patrick McDaniel

TL;DR

The paper tackles black-box adversarial evasion by framing adversarial sample generation as an MDP and learning attack policies via PPO. It introduces two RL-based attack variants, RL Max Loss and RL Min Norm, and demonstrates that agents improve attack success and reduce queries during training, with up to 13.2% higher ASR and up to 16.9% fewer queries, while achieving up to 17% higher success on unseen inputs post-training. The approach generalizes across CIFAR-10 and SVHN and across multiple victim models, and it outperforms leading black-box baselines on test data, signaling a potent, scalable attack vector that leverages past experience. These findings emphasize security implications and motivate defenses that account for persistent, learning adversaries, as well as future work on richer action spaces and cross-domain applicability.

Abstract

Attacks on machine learning models have been extensively studied through stateless optimization. In this paper, we demonstrate how a reinforcement learning (RL) agent can learn a new class of attack algorithms that generate adversarial samples. Unlike traditional adversarial machine learning (AML) methods that craft adversarial samples independently, our RL-based approach retains and exploits past attack experience to improve the effectiveness and efficiency of future attacks. We formulate adversarial sample generation as a Markov Decision Process and evaluate RL's ability to (a) learn effective and efficient attack strategies and (b) compete with state-of-the-art AML. On two image classification benchmarks, our agent increases attack success rate by up to 13.2% and decreases the average number of victim model queries per attack by up to 16.9% from the start to the end of training. In a head-to-head comparison with state-of-the-art image attacks, our approach enables an adversary to generate adversarial samples with 17% more success on unseen inputs post-training. From a security perspective, this work demonstrates a powerful new attack vector that uses RL to train agents that attack ML models efficiently and at scale.

Adversarial Agents: Black-Box Evasion Attacks with Reinforcement Learning

TL;DR

The paper tackles black-box adversarial evasion by framing adversarial sample generation as an MDP and learning attack policies via PPO. It introduces two RL-based attack variants, RL Max Loss and RL Min Norm, and demonstrates that agents improve attack success and reduce queries during training, with up to 13.2% higher ASR and up to 16.9% fewer queries, while achieving up to 17% higher success on unseen inputs post-training. The approach generalizes across CIFAR-10 and SVHN and across multiple victim models, and it outperforms leading black-box baselines on test data, signaling a potent, scalable attack vector that leverages past experience. These findings emphasize security implications and motivate defenses that account for persistent, learning adversaries, as well as future work on richer action spaces and cross-domain applicability.

Abstract

Attacks on machine learning models have been extensively studied through stateless optimization. In this paper, we demonstrate how a reinforcement learning (RL) agent can learn a new class of attack algorithms that generate adversarial samples. Unlike traditional adversarial machine learning (AML) methods that craft adversarial samples independently, our RL-based approach retains and exploits past attack experience to improve the effectiveness and efficiency of future attacks. We formulate adversarial sample generation as a Markov Decision Process and evaluate RL's ability to (a) learn effective and efficient attack strategies and (b) compete with state-of-the-art AML. On two image classification benchmarks, our agent increases attack success rate by up to 13.2% and decreases the average number of victim model queries per attack by up to 16.9% from the start to the end of training. In a head-to-head comparison with state-of-the-art image attacks, our approach enables an adversary to generate adversarial samples with 17% more success on unseen inputs post-training. From a security perspective, this work demonstrates a powerful new attack vector that uses RL to train agents that attack ML models efficiently and at scale.

Paper Structure

This paper contains 12 sections, 9 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Overview of the reinforcement learning attack with CIFAR-10. The adversary interacts with the victim model by querying images and receiving feedback, iteratively generating adversarial samples.
  • Figure 2: Training the adversarial agent: (1) a randomly sampled clean input and label $(x_0,y)$ from training dataset $\mathcal{D}$, (2) the start state $s_0=(x_0,y,Z(x_0))$ is initialized with the initial query of $x_0$, (3,4) the transition and reward function with respect to RL Max Loss and RL Min Norm mediate interaction with the victim model, (5) the policy $\pi_{\theta}$ is updated according to the RL algorithm.
  • Figure 3: RL Max Loss and RL Min Norm attack training: attack success rate (ASR), average queries on successful attacks (AQ), and average $\ell_2$-norm distortion on successful attacks ($\ell_2$) with respect to policy updates for 3 trials per attack with a 95% confidence interval on CIFAR-10 and SVHN datasets.
  • Figure 4: RL Max Loss Hyperparameter Sensitivity: attack success rate (ASR) versus Epsilon ($\epsilon$) for trained agents averaged over 3 random seeds.
  • Figure 5: RL Min Norm Hyperparameter Sensitivity: attack success rate (ASR) and $\ell_2$ distortion vs. $c$ for trained agents averaged over 3 random seeds.
  • ...and 2 more figures