Table of Contents
Fetching ...

Theoretical Understanding of Learning from Adversarial Perturbations

Soichiro Kumano, Hiroshi Kera, Toshihiko Yamasaki

TL;DR

The paper addresses why adversarial perturbations can deceive classifiers and transfer across models by introducing a theoretical framework based on a one-hidden-layer network trained on mutually orthogonal samples. It shows that perturbations act as class features, decomposing the learned decision boundary into contributions from mislabeled perturbations and geometry-inspired perturbations, and proves that, under mild conditions, the boundary aligns with that learned from clean data. Experimental results on artificial data and real datasets (MNIST, Fashion-MNIST, CIFAR-10) support the theory, demonstrating strong boundary alignment and generalization even when training data are perturbed or randomly labeled. These findings provide a fundamental justification for the feature-hypothesis and offer insights into the behavior of adversarial examples and their transferability across models.

Abstract

It is not fully understood why adversarial examples can deceive neural networks and transfer between different networks. To elucidate this, several studies have hypothesized that adversarial perturbations, while appearing as noises, contain class features. This is supported by empirical evidence showing that networks trained on mislabeled adversarial examples can still generalize well to correctly labeled test samples. However, a theoretical understanding of how perturbations include class features and contribute to generalization is limited. In this study, we provide a theoretical framework for understanding learning from perturbations using a one-hidden-layer network trained on mutually orthogonal samples. Our results highlight that various adversarial perturbations, even perturbations of a few pixels, contain sufficient class features for generalization. Moreover, we reveal that the decision boundary when learning from perturbations matches that from standard samples except for specific regions under mild conditions. The code is available at https://github.com/s-kumano/learning-from-adversarial-perturbations.

Theoretical Understanding of Learning from Adversarial Perturbations

TL;DR

The paper addresses why adversarial perturbations can deceive classifiers and transfer across models by introducing a theoretical framework based on a one-hidden-layer network trained on mutually orthogonal samples. It shows that perturbations act as class features, decomposing the learned decision boundary into contributions from mislabeled perturbations and geometry-inspired perturbations, and proves that, under mild conditions, the boundary aligns with that learned from clean data. Experimental results on artificial data and real datasets (MNIST, Fashion-MNIST, CIFAR-10) support the theory, demonstrating strong boundary alignment and generalization even when training data are perturbed or randomly labeled. These findings provide a fundamental justification for the feature-hypothesis and offer insights into the behavior of adversarial examples and their transferability across models.

Abstract

It is not fully understood why adversarial examples can deceive neural networks and transfer between different networks. To elucidate this, several studies have hypothesized that adversarial perturbations, while appearing as noises, contain class features. This is supported by empirical evidence showing that networks trained on mislabeled adversarial examples can still generalize well to correctly labeled test samples. However, a theoretical understanding of how perturbations include class features and contribute to generalization is limited. In this study, we provide a theoretical framework for understanding learning from perturbations using a one-hidden-layer network trained on mutually orthogonal samples. Our results highlight that various adversarial perturbations, even perturbations of a few pixels, contain sufficient class features for generalization. Moreover, we reveal that the decision boundary when learning from perturbations matches that from standard samples except for specific regions under mild conditions. The code is available at https://github.com/s-kumano/learning-from-adversarial-perturbations.
Paper Structure (35 sections, 26 theorems, 118 equations, 19 figures, 4 tables)

This paper contains 35 sections, 26 theorems, 118 equations, 19 figures, 4 tables.

Key Result

Theorem 3.3

Let $\{(\bm{x}_n,y_n)\}^N_{n=1} \subset \mathbb{R}^d \times \{\pm1\}$ be a training dataset. Let $R_\mathrm{max} := \max_n \|\bm{x}_n\|$, $R_\mathrm{min} := \min_n \|\bm{x}_n\|$, and $p_\mathrm{max} := \max_{n \ne k} | \langle \bm{x}_n, \bm{x}_k \rangle |$. A one-hidden-layer neural network $f: \ma

Figures (19)

  • Figure 1: Decision boundaries of classifiers trained on multidimensional artificial datasets. The axis vectors $\bm{v}$ and $\bm{u}$ are defined in \ref{['th:implicit-bias-full']}. Left: Boundaries from standard data and noises with and without adversarial perturbations ($d = 10,000$). The blue circles and orange crosses indicate standard data projections onto this plane. Right: Boundaries across varying input dimensions (fix $N^\mathrm{adv} = 10,000$) and number of noise samples (fix $d = 10,000$). First row: results from standard data; second and third rows: results from noises with and without adversarial perturbations, respectively. Percentages indicate the classification accuracy for the standard data.
  • Figure 2: Accuracy of classifiers trained on uniform noises with or without adversarial perturbations for standard data in artificial dataset. The blue solid and orange dashed lines represent the results from noises with and without perturbations (i.e., pure noises), respectively. We fix $N^\mathrm{adv} = 10,000$ on the left and $d = 10,000$ on the right.
  • Figure A3: Artificial data based on uniform noises. Standard images were drawn from the uniform distribution $U([-1,1]^d)$, and their corresponding labels from $U(\{\pm1\})$. We treated them as natural images. Noise images are similarly drawn from $U([-1,1]^d)$. Adversarial examples were generated to superimpose adversarial perturbations on the noise images to fool a classifier trained on the standard (but seemingly noisy) images. The labels below the adversarial examples indicate target labels that were randomly sampled from $\{\pm1\}$. Those below the noise images were used for comparative experiments in training classifiers on these noises.
  • Figure A4: Decision boundaries of classifiers trained on artificial datasets based on uniform noises and $L_0$ adversarial perturbations. Each variable was varied based on $d=10,000$, $N^\mathrm{adv} = 10,000$, $N = 1000$, and $d_\delta / d = 0.05$. The description is the same as \ref{['fig:decision-map-merge']}.
  • Figure A5: Decision boundaries of classifiers trained on artificial datasets based on Gauss noises and $L_0$ adversarial perturbations. The description is the same as \ref{['fig:decision-maps-L0-uniform']}.
  • ...and 14 more figures

Theorems & Definitions (47)

  • Definition 1.1: Learning from adversarial perturbations (later redefined) NRF
  • Definition 3.2: Learning from adversarial perturbations
  • Theorem 3.3: Rearranged from frei2023implicit
  • Corollary 3.4: Decision Boundary when learning from perturbations
  • Theorem 4.1: Decision boundary when learning from geometry-inspired perturbations on natural samples
  • Theorem 4.2: Consistent decision of learning from geometry-inspired perturbations on natural samples
  • Theorem 4.3: Decision boundary when learning from geometry-inspired perturbations on uniform noises
  • Theorem 4.4: Consistent decision of learning from geometry-inspired perturbations on uniform noises
  • Corollary 4.4: Complete classification for natural training samples when learning from geometry-inspired perturbations on uniform noises
  • Theorem B.1: Rearranged from frei2023implicit
  • ...and 37 more