Theoretical Understanding of Learning from Adversarial Perturbations
Soichiro Kumano, Hiroshi Kera, Toshihiko Yamasaki
TL;DR
The paper addresses why adversarial perturbations can deceive classifiers and transfer across models by introducing a theoretical framework based on a one-hidden-layer network trained on mutually orthogonal samples. It shows that perturbations act as class features, decomposing the learned decision boundary into contributions from mislabeled perturbations and geometry-inspired perturbations, and proves that, under mild conditions, the boundary aligns with that learned from clean data. Experimental results on artificial data and real datasets (MNIST, Fashion-MNIST, CIFAR-10) support the theory, demonstrating strong boundary alignment and generalization even when training data are perturbed or randomly labeled. These findings provide a fundamental justification for the feature-hypothesis and offer insights into the behavior of adversarial examples and their transferability across models.
Abstract
It is not fully understood why adversarial examples can deceive neural networks and transfer between different networks. To elucidate this, several studies have hypothesized that adversarial perturbations, while appearing as noises, contain class features. This is supported by empirical evidence showing that networks trained on mislabeled adversarial examples can still generalize well to correctly labeled test samples. However, a theoretical understanding of how perturbations include class features and contribute to generalization is limited. In this study, we provide a theoretical framework for understanding learning from perturbations using a one-hidden-layer network trained on mutually orthogonal samples. Our results highlight that various adversarial perturbations, even perturbations of a few pixels, contain sufficient class features for generalization. Moreover, we reveal that the decision boundary when learning from perturbations matches that from standard samples except for specific regions under mild conditions. The code is available at https://github.com/s-kumano/learning-from-adversarial-perturbations.
