Table of Contents
Fetching ...

Feature Averaging: An Implicit Bias of Gradient Descent Leading to Non-Robustness in Neural Networks

Binghui Li, Zhixuan Pan, Kaifeng Lyu, Jian Li

TL;DR

The paper identifies feature averaging as a key implicit bias of gradient descent in training two-layer ReLU networks on multi-cluster data, showing that weights converge toward averages of cluster centers and that this leads to non-robustness with a robustness radius on the order of ${ m O}(\sqrt{d/k})$. It proves finite-time convergence to the feature-averaging solution and demonstrates that more granular supervision—labeling individual features or clusters—enables gradient descent to learn decoupled, robust features with a robustness radius ${ m O}(\sqrt{d})$. Empirical results on synthetic data, MNIST, CIFAR-10, and transfer scenarios corroborate the theory, illustrating the robustness gains from fine-grained supervision and the prevalence of feature averaging under standard training. The findings highlight a fundamental optimization-driven limitation of gradient-based learning for robustness and suggest practical avenues to enhance robustness via supervisory granularity. Overall, the work contributes a rigorous link between optimization bias, feature learning dynamics, and adversarial robustness with actionable implications for designing robust models.

Abstract

In this work, we investigate a particular implicit bias in gradient descent training, which we term "Feature Averaging," and argue that it is one of the principal factors contributing to the non-robustness of deep neural networks. We show that, even when multiple discriminative features are present in the input data, neural networks trained by gradient descent tend to rely on an average (or a certain combination) of these features for classification, rather than distinguishing and leveraging each feature individually. Specifically, we provide a detailed theoretical analysis of the training dynamics of two-layer ReLU networks on a binary classification task, where the data distribution consists of multiple clusters with mutually orthogonal centers. We rigorously prove that gradient descent biases the network towards feature averaging, where the weights of each hidden neuron represent an average of the cluster centers (each corresponding to a distinct feature), thereby making the network vulnerable to input perturbations aligned with the negative direction of the averaged features. On the positive side, we demonstrate that this vulnerability can be mitigated through more granular supervision. In particular, we prove that a two-layer ReLU network can achieve optimal robustness when trained to classify individual features rather than merely the original binary classes. Finally, we validate our theoretical findings with experiments on synthetic datasets, MNIST, and CIFAR-10, and confirm the prevalence of feature averaging and its impact on adversarial robustness. We hope these theoretical and empirical insights deepen the understanding of how gradient descent shapes feature learning and adversarial robustness, and how more detailed supervision can enhance robustness.

Feature Averaging: An Implicit Bias of Gradient Descent Leading to Non-Robustness in Neural Networks

TL;DR

The paper identifies feature averaging as a key implicit bias of gradient descent in training two-layer ReLU networks on multi-cluster data, showing that weights converge toward averages of cluster centers and that this leads to non-robustness with a robustness radius on the order of . It proves finite-time convergence to the feature-averaging solution and demonstrates that more granular supervision—labeling individual features or clusters—enables gradient descent to learn decoupled, robust features with a robustness radius . Empirical results on synthetic data, MNIST, CIFAR-10, and transfer scenarios corroborate the theory, illustrating the robustness gains from fine-grained supervision and the prevalence of feature averaging under standard training. The findings highlight a fundamental optimization-driven limitation of gradient-based learning for robustness and suggest practical avenues to enhance robustness via supervisory granularity. Overall, the work contributes a rigorous link between optimization bias, feature learning dynamics, and adversarial robustness with actionable implications for designing robust models.

Abstract

In this work, we investigate a particular implicit bias in gradient descent training, which we term "Feature Averaging," and argue that it is one of the principal factors contributing to the non-robustness of deep neural networks. We show that, even when multiple discriminative features are present in the input data, neural networks trained by gradient descent tend to rely on an average (or a certain combination) of these features for classification, rather than distinguishing and leveraging each feature individually. Specifically, we provide a detailed theoretical analysis of the training dynamics of two-layer ReLU networks on a binary classification task, where the data distribution consists of multiple clusters with mutually orthogonal centers. We rigorously prove that gradient descent biases the network towards feature averaging, where the weights of each hidden neuron represent an average of the cluster centers (each corresponding to a distinct feature), thereby making the network vulnerable to input perturbations aligned with the negative direction of the averaged features. On the positive side, we demonstrate that this vulnerability can be mitigated through more granular supervision. In particular, we prove that a two-layer ReLU network can achieve optimal robustness when trained to classify individual features rather than merely the original binary classes. Finally, we validate our theoretical findings with experiments on synthetic datasets, MNIST, and CIFAR-10, and confirm the prevalence of feature averaging and its impact on adversarial robustness. We hope these theoretical and empirical insights deepen the understanding of how gradient descent shapes feature learning and adversarial robustness, and how more detailed supervision can enhance robustness.

Paper Structure

This paper contains 44 sections, 72 theorems, 264 equations, 14 figures.

Key Result

Theorem 4.5

In the setting of training a two-layer ReLU network on the binary classification problem $\mathcal{D}(\{{\bm{\mu}}_j\}_{j=1}^{k}, J_{\pm})$ as described in sec:setup, under assumption: featuresass:balancedmain_assumption: hyper, for some $\gamma = o(1)$, after $\Omega(\eta^{-1}) \le T \le \exp(\tild

Figures (14)

  • Figure 1: Schematic illustration of feature-averaging and feature-decoupling: We consider a dataset with $5$ clusters. The first three clusters belong to $J_{+}$, and the other two to $J_{-}$. Denote $\boldsymbol{\mu}_{+}:={(\boldsymbol{\mu}_1+\boldsymbol{\mu}_2+\boldsymbol{\mu}_3)}/{3}, \boldsymbol{\mu}_{-}:={(\boldsymbol{\mu}_4+\boldsymbol{\mu}_5)}/{2}$. For ease of illustration, we assume that $\sum_{j=1}^{5}\boldsymbol{\mu}_j = \boldsymbol{0}$. The feature-averaging classifier $f_{\mathrm{FA}}$ leverages two neurons with averaged features to classify all data, which corresponds to a linear classifier (the gray line). The feature-decoupling classifier $f_{\mathrm{FD}}$ leverages individual features and has more complex polyhedral decision boundary (green lines). Note that the instance is high dimensional and this is only a schematic illustration. The distance between data points and the decision boundary of $f_{\mathrm{FD}}$ (green lines) is much larger than that of $f_{\mathrm{FA}}$ (gray line), which implies that the feature-decoupling classifier is more robust than the feature-averaging one.
  • Figure 2: Illustration of feature averaging and feature decoupling on synthetic dataset (a,b) and CIFAR-10 dataset (c,d). Figure (a) and Figure (c) correspond to models trained using 2-class labels, and Figure (b) and Figure (d) correspond to models trained using 10-class labels, respectively. Each element in the matrix, located at position $(i,j)$, represents the average cosine value of the angle between the feature vector $\boldsymbol{\mu_i}$ of the $i$-th feature and the equivalent weight vector $\boldsymbol{w}_j$ of the $f_j(\cdot)$.
  • Figure 3: Verifying robustness improvement: We compare adversarial robustness between model trained by 2-class labels (red line) and model trained by 10-class labels (blue line) on synthetic data (the left), MNIST (the middle) and CIFAR-10 (the right).
  • Figure 4: A schematic illustration of the construction in li2022robust: The positive class consists of blue points and the negative class the red points. In their lower bound, there are in fact exponentially blue points slightly above the hyperplane and exponentially many red ones slightly below it. The hyperplane has perfect clean accuracy but is non-robust, while a more robust classifier exists (by classifying the blue balls from the red balls). One can observe the conceptual similarity with Figure \ref{['fig:overview']}.
  • Figure 5: Illustration of feature averaging on synthetic dataset, when varying the number of samples $n$.
  • ...and 9 more figures

Theorems & Definitions (132)

  • Definition 3.1: Multi-Cluster Data Distribution
  • Definition 4.1: Feature-Averaging Network
  • Remark 4.2
  • Remark 4.4: Discussion of Hyper-Parameter Choices
  • Theorem 4.5
  • Theorem 4.6: Conjecture 1 from min2024can
  • Theorem 4.7
  • Proposition 4.8
  • Lemma C.1: Weight Decomposition
  • Theorem C.2: Restatement of Theorem \ref{['thm:main_f_avg']}
  • ...and 122 more