Can Implicit Bias Imply Adversarial Robustness?
Hancheng Min, René Vidal
TL;DR
The work tackles the vulnerability of gradient-flow trained networks to adversarial perturbations under data with latent subclasses. By contrasting vanilla ReLU (p=1) with a generalized polynomial ReLU (p≥2), it shows that gradient bias can either align neurons with average class centers (harmful to robustness) or with subclass centers (robust, under certain data structures). The authors provide theoretical results and conjectures linking activation choice, neuron alignment, and robustness, complemented by numerical experiments on synthetic data and real datasets (e.g., MNIST parity and Caltech256 with a pre-trained feature extractor). The key finding is that a suitably designed pReLU activation can yield $O(1)$-robustness against adversarial attacks in shallow networks, highlighting the critical interplay between data geometry and architectural inductive biases for practical robustness gains.
Abstract
The implicit bias of gradient-based training algorithms has been considered mostly beneficial as it leads to trained networks that often generalize well. However, Frei et al. (2023) show that such implicit bias can harm adversarial robustness. Specifically, they show that if the data consists of clusters with small inter-cluster correlation, a shallow (two-layer) ReLU network trained by gradient flow generalizes well, but it is not robust to adversarial attacks of small radius. Moreover, this phenomenon occurs despite the existence of a much more robust classifier that can be explicitly constructed from a shallow network. In this paper, we extend recent analyses of neuron alignment to show that a shallow network with a polynomial ReLU activation (pReLU) trained by gradient flow not only generalizes well but is also robust to adversarial attacks. Our results highlight the importance of the interplay between data structure and architecture design in the implicit bias and robustness of trained networks.
