Table of Contents
Fetching ...

SoK: Analyzing Adversarial Examples: A Framework to Study Adversary Knowledge

Lucas Fenaux, Florian Kerschbaum

TL;DR

This work addresses the lack of formalization for adversary knowledge in adversarial example research for image classification. It introduces a rigorous framework based on information extraction oracles, distinguishers, and a standardized Adversarial Example Game to compare threat models and attacks, plus information Hasse diagrams to organize knowledge categories. Through a comprehensive survey of 2022+ attacks and an extensive evaluation on ImageNet and CIFAR-10, the authors show that attacker access to model, data, and training information dramatically shapes attack effectiveness, with transferable attacks nearly matching white-box performance under certain settings. The framework provides a path toward reproducible, comparable evaluations and reveals that defenses remain fragile against well-informed transferable attacks, underscoring the need for standardized threat models and evaluation protocols in adversarial ML.

Abstract

Adversarial examples are malicious inputs to machine learning models that trigger a misclassification. This type of attack has been studied for close to a decade, and we find that there is a lack of study and formalization of adversary knowledge when mounting attacks. This has yielded a complex space of attack research with hard-to-compare threat models and attacks. We focus on the image classification domain and provide a theoretical framework to study adversary knowledge inspired by work in order theory. We present an adversarial example game, inspired by cryptographic games, to standardize attacks. We survey recent attacks in the image classification domain and classify their adversary's knowledge in our framework. From this systematization, we compile results that both confirm existing beliefs about adversary knowledge, such as the potency of information about the attacked model as well as allow us to derive new conclusions on the difficulty associated with the white-box and transferable threat models, for example, that transferable attacks might not be as difficult as previously thought.

SoK: Analyzing Adversarial Examples: A Framework to Study Adversary Knowledge

TL;DR

This work addresses the lack of formalization for adversary knowledge in adversarial example research for image classification. It introduces a rigorous framework based on information extraction oracles, distinguishers, and a standardized Adversarial Example Game to compare threat models and attacks, plus information Hasse diagrams to organize knowledge categories. Through a comprehensive survey of 2022+ attacks and an extensive evaluation on ImageNet and CIFAR-10, the authors show that attacker access to model, data, and training information dramatically shapes attack effectiveness, with transferable attacks nearly matching white-box performance under certain settings. The framework provides a path toward reproducible, comparable evaluations and reveals that defenses remain fragile against well-informed transferable attacks, underscoring the need for standardized threat models and evaluation protocols in adversarial ML.

Abstract

Adversarial examples are malicious inputs to machine learning models that trigger a misclassification. This type of attack has been studied for close to a decade, and we find that there is a lack of study and formalization of adversary knowledge when mounting attacks. This has yielded a complex space of attack research with hard-to-compare threat models and attacks. We focus on the image classification domain and provide a theoretical framework to study adversary knowledge inspired by work in order theory. We present an adversarial example game, inspired by cryptographic games, to standardize attacks. We survey recent attacks in the image classification domain and classify their adversary's knowledge in our framework. From this systematization, we compile results that both confirm existing beliefs about adversary knowledge, such as the potency of information about the attacked model as well as allow us to derive new conclusions on the difficulty associated with the white-box and transferable threat models, for example, that transferable attacks might not be as difficult as previously thought.
Paper Structure (36 sections, 7 theorems, 7 equations, 7 figures, 15 tables)

This paper contains 36 sections, 7 theorems, 7 equations, 7 figures, 15 tables.

Key Result

Theorem 1

Figure fig:model_hasse_diagram holds under the $\sqsubset$ ordering.

Figures (7)

  • Figure 1: Model Oracle Hasse Diagram
  • Figure 2: Data Oracle Hasse Diagram
  • Figure 3: Train Oracle Hasse Diagram
  • Figure 4: Defense Oracle Hasse Diagram
  • Figure 5: Adversarial Example Game Diagram Algorithm
  • ...and 2 more figures

Theorems & Definitions (29)

  • Definition 1: Adversarial Example
  • Definition 2: Grounded Adversarial Example
  • Definition 3: Targeted/Untargeted Adversarial Example
  • Definition 4: Indistinguishability
  • Definition 5: Stealth
  • Definition 6: Information Extraction Oracle
  • Definition 7
  • Definition 8: Information Extraction Oracle domination operator
  • Definition 9: Information Extraction Oracle Combination
  • Theorem 1
  • ...and 19 more