Table of Contents
Fetching ...

Can Implicit Bias Imply Adversarial Robustness?

Hancheng Min, René Vidal

TL;DR

The work tackles the vulnerability of gradient-flow trained networks to adversarial perturbations under data with latent subclasses. By contrasting vanilla ReLU (p=1) with a generalized polynomial ReLU (p≥2), it shows that gradient bias can either align neurons with average class centers (harmful to robustness) or with subclass centers (robust, under certain data structures). The authors provide theoretical results and conjectures linking activation choice, neuron alignment, and robustness, complemented by numerical experiments on synthetic data and real datasets (e.g., MNIST parity and Caltech256 with a pre-trained feature extractor). The key finding is that a suitably designed pReLU activation can yield $O(1)$-robustness against adversarial attacks in shallow networks, highlighting the critical interplay between data geometry and architectural inductive biases for practical robustness gains.

Abstract

The implicit bias of gradient-based training algorithms has been considered mostly beneficial as it leads to trained networks that often generalize well. However, Frei et al. (2023) show that such implicit bias can harm adversarial robustness. Specifically, they show that if the data consists of clusters with small inter-cluster correlation, a shallow (two-layer) ReLU network trained by gradient flow generalizes well, but it is not robust to adversarial attacks of small radius. Moreover, this phenomenon occurs despite the existence of a much more robust classifier that can be explicitly constructed from a shallow network. In this paper, we extend recent analyses of neuron alignment to show that a shallow network with a polynomial ReLU activation (pReLU) trained by gradient flow not only generalizes well but is also robust to adversarial attacks. Our results highlight the importance of the interplay between data structure and architecture design in the implicit bias and robustness of trained networks.

Can Implicit Bias Imply Adversarial Robustness?

TL;DR

The work tackles the vulnerability of gradient-flow trained networks to adversarial perturbations under data with latent subclasses. By contrasting vanilla ReLU (p=1) with a generalized polynomial ReLU (p≥2), it shows that gradient bias can either align neurons with average class centers (harmful to robustness) or with subclass centers (robust, under certain data structures). The authors provide theoretical results and conjectures linking activation choice, neuron alignment, and robustness, complemented by numerical experiments on synthetic data and real datasets (e.g., MNIST parity and Caltech256 with a pre-trained feature extractor). The key finding is that a suitably designed pReLU activation can yield -robustness against adversarial attacks in shallow networks, highlighting the critical interplay between data geometry and architectural inductive biases for practical robustness gains.

Abstract

The implicit bias of gradient-based training algorithms has been considered mostly beneficial as it leads to trained networks that often generalize well. However, Frei et al. (2023) show that such implicit bias can harm adversarial robustness. Specifically, they show that if the data consists of clusters with small inter-cluster correlation, a shallow (two-layer) ReLU network trained by gradient flow generalizes well, but it is not robust to adversarial attacks of small radius. Moreover, this phenomenon occurs despite the existence of a much more robust classifier that can be explicitly constructed from a shallow network. In this paper, we extend recent analyses of neuron alignment to show that a shallow network with a polynomial ReLU activation (pReLU) trained by gradient flow not only generalizes well but is also robust to adversarial attacks. Our results highlight the importance of the interplay between data structure and architecture design in the implicit bias and robustness of trained networks.
Paper Structure (32 sections, 14 theorems, 84 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 32 sections, 14 theorems, 84 equations, 7 figures, 1 table, 1 algorithm.

Key Result

Proposition 1

Given a test sample $(\boldsymbol{x},y)\sim \mathcal{D}_{X,Y}$, and classifiers $F$ and $F^{(p)}$, $p\geq 1$, defined in eq_f0 and eq_fp, resp., there is a constant $C$ such that

Figures (7)

  • Figure 1: Visualizing the training of a pReLU network under small initialization (Details explained in later sections). The dataset has its positive class sampled from two subclasses. (a) At initialization, all neurons have small norms and point toward random directions; When $p=1$ (vanilla ReLU network), (b) During the alignment phase, the neuron directions are aligned with either of the average class centers $\bar{\boldsymbol{\mu}}_+$ and $\bar{\boldsymbol{\mu}}_-$; (c) During the second phase, neurons keep the alignment with $\bar{\boldsymbol{\mu}}_+$ and $\bar{\boldsymbol{\mu}}_-$ while growing their norms; When $p=3$, (d) neurons learn subclass centers during alignment phase and (e) keep the alignment in the second phase. Note: the neurons pointing toward directions other than class/subclass centers are not activated by any data point and have small norms throughout training.
  • Figure 2: Numerical experiment ($K=10,K_1=6$) validates Conjecture \ref{['conj_conv']}. (a) We train pReLU networks using SGD with small initialization, then estimate the distance $\text{dist}(f_p,F)$ between the trained network $f_p$ and the classifier $F$, when $p=1$ (top plot); When $p=3$, we estimate $\text{dist}(f_p,F^{(p)})$ instead (bottom plot). The training is done under different choices of intra-subclass variance $\alpha$ and repeated 10 times per $\alpha$; the Solid line shows the average and the shade denotes the region between max and min values. (b) Given a trained network obtained from an instance of this training ($\alpha=0.1$), we reorder the neurons w.r.t. their contributions $|v_j|\|\boldsymbol{w}_j\|$ and then plot the contributions in a bar plot; (c)(d) For neurons with large contributions, we plot a colormap, with each pixel represents some $\cos(\boldsymbol{w}_j,\boldsymbol{\mu})$, where $\boldsymbol{\mu}$ could be either average class centers $\bar{\boldsymbol{\mu}}_+$ and $\bar{\boldsymbol{\mu}}_-$ or subclass centers $\boldsymbol{\mu}_k,k\in[K]$. Note: For visibility, the neurons are reordered again so that neurons aligned with the same $\boldsymbol{\mu}$ are grouped together. (e) Lastly, we carry out $l_2$ PGD attack on a test dataset and plot the robust accuracy of the trained network under different choices of attack radius.
  • Figure 3: Parity prediction on MNIST dataset with pReLU networks. (a) We plot the data correlation as a colormap, where each pixel represents some $\cos\left( x_i,x_j\right)$ between two centered data $x_i,x_j$ from MNIST training dataset; (b) We run Adam with batch size $1000$ to train a pReLU network under Kaiming initialization (repeated 10 times), then plot the training/testing accuracy during training for different choice of $p$ (The shade region indicates the range between the minimum and maximum values over 10 randomized runs); (c) We stack the hidden post-activation representation of each training sample into a matrix and compute its stable rank, and plot the evolution of this stable rank during training; (d) After training for $50$ epoch, we carry out APGD $l_\infty$-attack on MNIST test dataset (in pixel space) and plot the robust accuracy of the trained pReLU network under different choice of attack radius.
  • Figure 4: Classification on Caltech256 dataset (relabeled into 10 superclasses) with a pre-trained ResNet152 as a fixed feature extractor.
  • Figure 5: Additional Numerical experiment ($K=20,K_1=8$) validates Conjecture \ref{['conj_conv']}. (a) We train pReLU networks using SGD with small initialization, then estimate the distance $\text{dist}(f_p,F)$ between the trained network $f_p$ and the classifier $F$, when $p=1$ (top plot); When $p=3$, we estimate $\text{dist}(f_p,F^{(p)})$ instead (bottom plot). The training is done under different choices of intra-subclass variance $\alpha$ and repeated 10 times per $\alpha$; the Solid line shows the average and the shade denotes the region between max and min values. (b) Given a trained network obtained from an instance of this training ($\alpha=0.1$), we reorder the neurons w.r.t. their contributions $|v_j|\|\boldsymbol{w}_j\|$ and then plot the contributions in a bar plot; (c)(d) Given neurons with large contributions, we plot a colormap, with each pixel represents some $\cos(\boldsymbol{w}_j,\boldsymbol{\mu})$, where $\boldsymbol{\mu}$ could be either average class centers $\bar{\boldsymbol{\mu}}_+$ and $\bar{\boldsymbol{\mu}}_-$ or subclass centers $\boldsymbol{\mu}_k,k\in[K]$. Note: For visibility, the neurons are reordered again so that neurons aligned with the same $\boldsymbol{\mu}$ are grouped together. (e) Lastly, we carry out $l_2$ PGD attack on a test dataset and plot the robust accuracy of the trained network under different choices of attack radius.
  • ...and 2 more figures

Theorems & Definitions (28)

  • Claim
  • Remark 1
  • Proposition 1: Generalization on clean data
  • Theorem 1: $l_2$-adversarial robustness
  • Remark 2
  • Conjecture 1
  • Lemma 1
  • Theorem 2: Alignment bias of positive neurons
  • Claim : restated
  • proof
  • ...and 18 more