PubDef: Defending Against Transfer Attacks From Public Models

Chawin Sitawarin; Jaewon Chang; David Huang; Wesson Altoyan; David Wagner

PubDef: Defending Against Transfer Attacks From Public Models

Chawin Sitawarin, Jaewon Chang, David Huang, Wesson Altoyan, David Wagner

TL;DR

Under this threat model, the PubDef defense, PubDef, outperforms the state-of-the-art white-box adversarial training by a large margin with almost no loss in the normal accuracy.

Abstract

Adversarial attacks have been a looming and unaddressed threat in the industry. However, through a decade-long history of the robustness evaluation literature, we have learned that mounting a strong or optimal attack is challenging. It requires both machine learning and domain expertise. In other words, the white-box threat model, religiously assumed by a large majority of the past literature, is unrealistic. In this paper, we propose a new practical threat model where the adversary relies on transfer attacks through publicly available surrogate models. We argue that this setting will become the most prevalent for security-sensitive applications in the future. We evaluate the transfer attacks in this setting and propose a specialized defense method based on a game-theoretic perspective. The defenses are evaluated under 24 public models and 11 attack algorithms across three datasets (CIFAR-10, CIFAR-100, and ImageNet). Under this threat model, our defense, PubDef, outperforms the state-of-the-art white-box adversarial training by a large margin with almost no loss in the normal accuracy. For instance, on ImageNet, our defense achieves 62% accuracy under the strongest transfer attack vs only 36% of the best adversarially trained model. Its accuracy when not under attack is only 2% lower than that of an undefended model (78% vs 80%). We release our code at https://github.com/wagner-group/pubdef.

PubDef: Defending Against Transfer Attacks From Public Models

TL;DR

Under this threat model, the PubDef defense, PubDef, outperforms the state-of-the-art white-box adversarial training by a large margin with almost no loss in the normal accuracy.

Abstract

Paper Structure (33 sections, 1 theorem, 7 equations, 16 figures, 12 tables)

This paper contains 33 sections, 1 theorem, 7 equations, 16 figures, 12 tables.

Introduction
Related Work
Threat Model
Game-Theoretic Perspective
Simple Game
Complex Game
Our Practical Defense
Loss Function and Weighting Constants
Defender's Source Model Selection
Experiments
Setup
Results
Discussion
Ablation Studies
Robustness to White-Box and Query-Based Attacks
...and 18 more sections

Key Result

Theorem 1

Given a "simple game" described above and its payoff matrix ${\bm{R}}$, there exists a mixed strategy $\pi_A^*$ for the attacker and a mixed strategy $\pi_D^*$ for the defender such that where $\Delta^{sa-1}$ is the ($sa-1$)-dimensional probability simplex.

Figures (16)

Figure 1: (a) Proposed threat model: transfer attack with public source models (TAPM). We consider a low-cost black-box adversary who generates adversarial examples from publicly available models with a known attack algorithm. (b) Our approach is based on stopping each major category of attack with a combination of multiple mechanisms. (c) Our defense, PubDef, trains the defended model to resist transfer attacks from several publicly available source models. Our model is robust to a wide range of transfer attacks, including both those from source models that were trained against and others that were not trained against, while also maintaining high clean accuracy.
Figure 2: The payoff matrix of the simple game.
Figure 3: Adversarial accuracy of PubDef against 264 transfer attacks (24 source models $\times$ 11 attack algorithms) on ImageNet. ✪ denotes the source models this defense is trained against. We cannot produce NA attack on timm's VGG model (shown as "n/a") because of its in-place operation.
Figure 4: Adversarial accuracy of PubDef under seen/unseen transfer attacks. Seen attacks (seen src. and seen algo.) are the 3--4 attacks that were used to train our defense, unseen attacks are all others from the set of 264 possible attacks. They are categorized by whether the source models (src.) and the attack algorithms (algo.) are seen. All non-PGD attacks are unseen attack algorithms.
Figure 5: Clean and adversarial accuracy on four PubDef models trained with 4 ($4 \times 1$), 8 ($4 \times 2$), 12 ($4 \times 3$), and 24 (All) source models. "4 $\times m$" means $m$ source models are chosen from each of the four groups.
...and 11 more figures

Theorems & Definitions (1)

Theorem 1: von Neumann's minimax theorem with a bilinear function v.neumann_zur_1928

PubDef: Defending Against Transfer Attacks From Public Models

TL;DR

Abstract

PubDef: Defending Against Transfer Attacks From Public Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (1)