Table of Contents
Fetching ...

Wide Two-Layer Networks can Learn from Adversarial Perturbations

Soichiro Kumano, Hiroshi Kera, Toshihiko Yamasaki

TL;DR

The paper addresses why adversarial perturbations enable learning by proposing a theoretical justification for the feature hypothesis and perturbation learning in wide, two-layer networks trained in the lazy regime. By showing that adversarial perturbations are effectively parallel to a weighted sum of training samples plus a small residual, and that a classifier trained only on perturbed or mislabeled data can match the predictions of a clean-data classifier under three mild conditions, the work provides a principled foundation for perturbation-based learning. The authors contrast their mild, data-distribution-agnostic results with prior work that required stringent assumptions, and demonstrate the theory through experiments on synthetic and real datasets, highlighting the practical relevance of wide-network, lazy-training analysis. The findings deepen our understanding of adversarial examples and have implications for robustness and transfer in high-dimensional settings, suggesting when perturbation-based hints can generalize to clean data across models. Overall, the work offers a solid theoretical mechanism linking perturbations to dataset-wide information and class-specific features, advancing foundational insight into adversarial phenomena.

Abstract

Adversarial examples have raised several open questions, such as why they can deceive classifiers and transfer between different models. A prevailing hypothesis to explain these phenomena suggests that adversarial perturbations appear as random noise but contain class-specific features. This hypothesis is supported by the success of perturbation learning, where classifiers trained solely on adversarial examples and the corresponding incorrect labels generalize well to correctly labeled test data. Although this hypothesis and perturbation learning are effective in explaining intriguing properties of adversarial examples, their solid theoretical foundation is limited. In this study, we theoretically explain the counterintuitive success of perturbation learning. We assume wide two-layer networks and the results hold for any data distribution. We prove that adversarial perturbations contain sufficient class-specific features for networks to generalize from them. Moreover, the predictions of classifiers trained on mislabeled adversarial examples coincide with those of classifiers trained on correctly labeled clean samples. The code is available at https://github.com/s-kumano/perturbation-learning.

Wide Two-Layer Networks can Learn from Adversarial Perturbations

TL;DR

The paper addresses why adversarial perturbations enable learning by proposing a theoretical justification for the feature hypothesis and perturbation learning in wide, two-layer networks trained in the lazy regime. By showing that adversarial perturbations are effectively parallel to a weighted sum of training samples plus a small residual, and that a classifier trained only on perturbed or mislabeled data can match the predictions of a clean-data classifier under three mild conditions, the work provides a principled foundation for perturbation-based learning. The authors contrast their mild, data-distribution-agnostic results with prior work that required stringent assumptions, and demonstrate the theory through experiments on synthetic and real datasets, highlighting the practical relevance of wide-network, lazy-training analysis. The findings deepen our understanding of adversarial examples and have implications for robustness and transfer in high-dimensional settings, suggesting when perturbation-based hints can generalize to clean data across models. Overall, the work offers a solid theoretical mechanism linking perturbations to dataset-wide information and class-specific features, advancing foundational insight into adversarial phenomena.

Abstract

Adversarial examples have raised several open questions, such as why they can deceive classifiers and transfer between different models. A prevailing hypothesis to explain these phenomena suggests that adversarial perturbations appear as random noise but contain class-specific features. This hypothesis is supported by the success of perturbation learning, where classifiers trained solely on adversarial examples and the corresponding incorrect labels generalize well to correctly labeled test data. Although this hypothesis and perturbation learning are effective in explaining intriguing properties of adversarial examples, their solid theoretical foundation is limited. In this study, we theoretically explain the counterintuitive success of perturbation learning. We assume wide two-layer networks and the results hold for any data distribution. We prove that adversarial perturbations contain sufficient class-specific features for networks to generalize from them. Moreover, the predictions of classifiers trained on mislabeled adversarial examples coincide with those of classifiers trained on correctly labeled clean samples. The code is available at https://github.com/s-kumano/perturbation-learning.

Paper Structure

This paper contains 22 sections, 25 theorems, 119 equations, 13 figures, 1 table.

Key Result

Theorem 3.3

Let $\delta = \Theta(1)$ be a small positive number. Under asm:width, for any $n \in [N]$, with probability at least $1 - \delta$, the adversarial perturbation $\bm{r}_n$ is parallel to the weighted sum of training samples as follows: where $\bm{\xi}_n$ satisfies $\|\bm{\xi}_n\| = \tilde{\mathcal{O}}(1)$. In particular, for $\ell(s) = s$,

Figures (13)

  • Figure 1: Counterintuitive generalization of perturbation learning.footnote A classifier $g$ is trained solely on mislabeled adversarial examples $\mathcal{D}^\mathrm{adv}:=\{(\bm{x}^\mathrm{adv}_n,y^\mathrm{adv}_n)\}^N_{n=1}$. These examples $\bm{x}^\mathrm{adv}_n$ are generated to mislead a classifier $f$, which is trained on correctly labeled clean samples $\mathcal{D}:=\{(\bm{x}_n,y_n)\}^N_{n=1}$, into predicting $y^\mathrm{adv}_n$ ($\ne y_n$). Surprisingly, despite being trained only on mislabeled data, the classifier $g$ generalizes well to clean test samples. This counterintuitive result suggests that adversarial perturbations contain label-aligned class features, enabling the classifier $g$ to generalize from them.
  • Figure 2: The regions where \ref{['ineq:cond-1', 'ineq:cond-2-a', 'eq:cond-3-a']} hold (colored areas) and their intersection.
  • Figure 3: Accuracy on the mean-shifted Gaussian dataset in Scenario (a). The blue lines represent accuracy of the classifier $f$ on $\mathcal{D}$, i.e., training accuracy. The orange lines represent accuracy of the classifier $g$ on $\mathcal{D}$.
  • Figure A4: Accuracy and agreement ratio on the mean-shifted Gaussian in Scenario (a). The blue lines represent the accuracy of the classifier $f$ on $\mathcal{D} := \{(\bm{x}_n,y_n)\}^N_{n=1}$, i.e., training accuracy. The orange lines represent the accuracy of the classifier $g$ on $\mathcal{D}$. The green lines represent the prediction agreement between $f$ and $g$ on the test dataset.
  • Figure A5: Accuracy and agreement ratio on the mean-shifted Gaussian in Scenario (b). The description is the same as \ref{['fig:shifted-gauss-a']}.
  • ...and 8 more figures

Theorems & Definitions (45)

  • Theorem 3.3: Direction of adversarial perturbation
  • Theorem 3.4: Perturbation learning, Scenario (a), special case of \ref{['th:PL-a-general']}
  • Theorem 3.5: Perturbation learning, Scenario (b), special case of \ref{['th:PL-b-general']}
  • Lemma C.1: Properties of Gaussian random variables
  • proof
  • Lemma C.2: Hoeffding's inequality
  • proof
  • Lemma C.3: Expectation of product of derivatives of activation functions, part 1
  • proof
  • Lemma C.4: Expectation of product of derivatives of activation functions, part 2
  • ...and 35 more