Evading Black-box Classifiers Without Breaking Eggs

Edoardo Debenedetti; Nicholas Carlini; Florian Tramèr

Evading Black-box Classifiers Without Breaking Eggs

Edoardo Debenedetti, Nicholas Carlini, Florian Tramèr

TL;DR

It is posed as an open problem to build black-box attacks that are more effective under realistic cost metrics to tackle decision-based evasion attacks that issue a large number of queries that would get flagged by a security-critical system.

Abstract

Decision-based evasion attacks repeatedly query a black-box classifier to generate adversarial examples. Prior work measures the cost of such attacks by the total number of queries made to the classifier. We argue this metric is flawed. Most security-critical machine learning systems aim to weed out "bad" data (e.g., malware, harmful content, etc). Queries to such systems carry a fundamentally asymmetric cost: queries detected as "bad" come at a higher cost because they trigger additional security filters, e.g., usage throttling or account suspension. Yet, we find that existing decision-based attacks issue a large number of "bad" queries, which likely renders them ineffective against security-critical systems. We then design new attacks that reduce the number of bad queries by $1.5$-$7.3\times$, but often at a significant increase in total (non-bad) queries. We thus pose it as an open problem to build black-box attacks that are more effective under realistic cost metrics.

Evading Black-box Classifiers Without Breaking Eggs

TL;DR

Abstract

, but often at a significant increase in total (non-bad) queries. We thus pose it as an open problem to build black-box attacks that are more effective under realistic cost metrics.

Paper Structure (50 sections, 2 theorems, 1 equation, 11 figures, 4 tables)

This paper contains 50 sections, 2 theorems, 1 equation, 11 figures, 4 tables.

Introduction
Decision-based Attacks
Applications
Asymmetric Query Costs
Example: evading NSFW content detection
Existing attacks are not designed to be stealthy
Selecting the values of $c_0$ and $c_{\mathrm{flagged\xspace}}$
Designing Stealthy Decision-based Attacks
How do Decision-based Attacks Work?
An overview of existing attacks
Maximizing Information per Flagged Query
Measuring distances with one flagged query
Trading non-flagged and flagged queries
A further optimization: early stopping
Stealthy Variants of Decision-based Attacks
...and 35 more sections

Key Result

Theorem 1

Assume $g$ has gradients that are $L$-Lipschitz and bounded by $C$ (assume $L$ and $C$ are constants for simplicity). Let $d$ be the data dimensionality. Optimizing $g$ with $T$ iterations of gradient descent, using Opt's gradient estimator, yields a convergence rate of $\mathbb{E}[\|\nabla g(x)\|_2

Figures (11)

Figure 1: Existing decision-based attacks (left) make many "flagged" queries, that get classified into the class that the attacker aims to evade (e.g., NSFW content). In security-critical applications, such flagged queries would trigger additional security mechanisms, thus raising the cost of the attack. Our stealthy attacks (right) trade-off flagged queries for non-flagged ones, to find adversarial examples without raising security alerts.
Figure 2: Line-search strategies to find the boundary (in gray) between a benign input (green) and the original flagged input (brown). Red crosses are flagged queries.
Figure 3: Our stealthy attacks find small adversarial perturbations with fewer flagged queries. For each benchmark, we report the median adversarial distance as a function of flagged queries for various $\ell_2$ attacks (top) and $\ell_\infty$ attacks (bottom). Our attack variants (full lines)--designed to be stealthy--require fewer flagged queries than the original attacks (dashed lines) to reach the same adversarial distance.
Figure 4: Attack on a commercial black-box NSFW detector. We run RayS and Stealthy RayS on 200 samples from our ImageNet-NSFW dataset. We denote a query as flagged if it is flagged as "likely" to be NSFW. The stealthy attack needs $2.2\times$ fewer flagged queries to find adversarial perturbations of size $\epsilon=32/355$.
Figure 5: Trade-offs between non-flagged and flagged queries for various search strategies in the RayS attack on ImageNet.
...and 6 more figures

Theorems & Definitions (4)

Theorem 1: Adapted from liu2018zeroth (Theorem 2)
Theorem 2: Adapted from cheng2019sign (Theorem 3.1)
proof : Proof of Theorem 1
proof : Proof of Theorem 2

Evading Black-box Classifiers Without Breaking Eggs

TL;DR

Abstract

Evading Black-box Classifiers Without Breaking Eggs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (4)