Feature Averaging: An Implicit Bias of Gradient Descent Leading to Non-Robustness in Neural Networks
Binghui Li, Zhixuan Pan, Kaifeng Lyu, Jian Li
TL;DR
The paper identifies feature averaging as a key implicit bias of gradient descent in training two-layer ReLU networks on multi-cluster data, showing that weights converge toward averages of cluster centers and that this leads to non-robustness with a robustness radius on the order of ${ m O}(\sqrt{d/k})$. It proves finite-time convergence to the feature-averaging solution and demonstrates that more granular supervision—labeling individual features or clusters—enables gradient descent to learn decoupled, robust features with a robustness radius ${ m O}(\sqrt{d})$. Empirical results on synthetic data, MNIST, CIFAR-10, and transfer scenarios corroborate the theory, illustrating the robustness gains from fine-grained supervision and the prevalence of feature averaging under standard training. The findings highlight a fundamental optimization-driven limitation of gradient-based learning for robustness and suggest practical avenues to enhance robustness via supervisory granularity. Overall, the work contributes a rigorous link between optimization bias, feature learning dynamics, and adversarial robustness with actionable implications for designing robust models.
Abstract
In this work, we investigate a particular implicit bias in gradient descent training, which we term "Feature Averaging," and argue that it is one of the principal factors contributing to the non-robustness of deep neural networks. We show that, even when multiple discriminative features are present in the input data, neural networks trained by gradient descent tend to rely on an average (or a certain combination) of these features for classification, rather than distinguishing and leveraging each feature individually. Specifically, we provide a detailed theoretical analysis of the training dynamics of two-layer ReLU networks on a binary classification task, where the data distribution consists of multiple clusters with mutually orthogonal centers. We rigorously prove that gradient descent biases the network towards feature averaging, where the weights of each hidden neuron represent an average of the cluster centers (each corresponding to a distinct feature), thereby making the network vulnerable to input perturbations aligned with the negative direction of the averaged features. On the positive side, we demonstrate that this vulnerability can be mitigated through more granular supervision. In particular, we prove that a two-layer ReLU network can achieve optimal robustness when trained to classify individual features rather than merely the original binary classes. Finally, we validate our theoretical findings with experiments on synthetic datasets, MNIST, and CIFAR-10, and confirm the prevalence of feature averaging and its impact on adversarial robustness. We hope these theoretical and empirical insights deepen the understanding of how gradient descent shapes feature learning and adversarial robustness, and how more detailed supervision can enhance robustness.
