Wide Two-Layer Networks can Learn from Adversarial Perturbations
Soichiro Kumano, Hiroshi Kera, Toshihiko Yamasaki
TL;DR
The paper addresses why adversarial perturbations enable learning by proposing a theoretical justification for the feature hypothesis and perturbation learning in wide, two-layer networks trained in the lazy regime. By showing that adversarial perturbations are effectively parallel to a weighted sum of training samples plus a small residual, and that a classifier trained only on perturbed or mislabeled data can match the predictions of a clean-data classifier under three mild conditions, the work provides a principled foundation for perturbation-based learning. The authors contrast their mild, data-distribution-agnostic results with prior work that required stringent assumptions, and demonstrate the theory through experiments on synthetic and real datasets, highlighting the practical relevance of wide-network, lazy-training analysis. The findings deepen our understanding of adversarial examples and have implications for robustness and transfer in high-dimensional settings, suggesting when perturbation-based hints can generalize to clean data across models. Overall, the work offers a solid theoretical mechanism linking perturbations to dataset-wide information and class-specific features, advancing foundational insight into adversarial phenomena.
Abstract
Adversarial examples have raised several open questions, such as why they can deceive classifiers and transfer between different models. A prevailing hypothesis to explain these phenomena suggests that adversarial perturbations appear as random noise but contain class-specific features. This hypothesis is supported by the success of perturbation learning, where classifiers trained solely on adversarial examples and the corresponding incorrect labels generalize well to correctly labeled test data. Although this hypothesis and perturbation learning are effective in explaining intriguing properties of adversarial examples, their solid theoretical foundation is limited. In this study, we theoretically explain the counterintuitive success of perturbation learning. We assume wide two-layer networks and the results hold for any data distribution. We prove that adversarial perturbations contain sufficient class-specific features for networks to generalize from them. Moreover, the predictions of classifiers trained on mislabeled adversarial examples coincide with those of classifiers trained on correctly labeled clean samples. The code is available at https://github.com/s-kumano/perturbation-learning.
