Adversarial Agents: Black-Box Evasion Attacks with Reinforcement Learning
Kyle Domico, Jean-Charles Noirot Ferrand, Ryan Sheatsley, Eric Pauley, Josiah Hanna, Patrick McDaniel
TL;DR
The paper tackles black-box adversarial evasion by framing adversarial sample generation as an MDP and learning attack policies via PPO. It introduces two RL-based attack variants, RL Max Loss and RL Min Norm, and demonstrates that agents improve attack success and reduce queries during training, with up to 13.2% higher ASR and up to 16.9% fewer queries, while achieving up to 17% higher success on unseen inputs post-training. The approach generalizes across CIFAR-10 and SVHN and across multiple victim models, and it outperforms leading black-box baselines on test data, signaling a potent, scalable attack vector that leverages past experience. These findings emphasize security implications and motivate defenses that account for persistent, learning adversaries, as well as future work on richer action spaces and cross-domain applicability.
Abstract
Attacks on machine learning models have been extensively studied through stateless optimization. In this paper, we demonstrate how a reinforcement learning (RL) agent can learn a new class of attack algorithms that generate adversarial samples. Unlike traditional adversarial machine learning (AML) methods that craft adversarial samples independently, our RL-based approach retains and exploits past attack experience to improve the effectiveness and efficiency of future attacks. We formulate adversarial sample generation as a Markov Decision Process and evaluate RL's ability to (a) learn effective and efficient attack strategies and (b) compete with state-of-the-art AML. On two image classification benchmarks, our agent increases attack success rate by up to 13.2% and decreases the average number of victim model queries per attack by up to 16.9% from the start to the end of training. In a head-to-head comparison with state-of-the-art image attacks, our approach enables an adversary to generate adversarial samples with 17% more success on unseen inputs post-training. From a security perspective, this work demonstrates a powerful new attack vector that uses RL to train agents that attack ML models efficiently and at scale.
