Table of Contents
Fetching ...

Adversarial Training Can Provably Improve Robustness: Theoretical Analysis of Feature Learning Process Under Structured Data

Binghui Li, Yuanzhi Li

TL;DR

A theoretical understanding of adversarial examples and adversarial training algorithms from the perspective of feature learning theory is provided and it is shown that the adversarial training method can provably strengthen the robust feature learning and suppress the non-robust feature learning to improve the network robustness.

Abstract

Adversarial training is a widely-applied approach to training deep neural networks to be robust against adversarial perturbation. However, although adversarial training has achieved empirical success in practice, it still remains unclear why adversarial examples exist and how adversarial training methods improve model robustness. In this paper, we provide a theoretical understanding of adversarial examples and adversarial training algorithms from the perspective of feature learning theory. Specifically, we focus on a multiple classification setting, where the structured data can be composed of two types of features: the robust features, which are resistant to perturbation but sparse, and the non-robust features, which are susceptible to perturbation but dense. We train a two-layer smoothed ReLU convolutional neural network to learn our structured data. First, we prove that by using standard training (gradient descent over the empirical risk), the network learner primarily learns the non-robust feature rather than the robust feature, which thereby leads to the adversarial examples that are generated by perturbations aligned with negative non-robust feature directions. Then, we consider the gradient-based adversarial training algorithm, which runs gradient ascent to find adversarial examples and runs gradient descent over the empirical risk at adversarial examples to update models. We show that the adversarial training method can provably strengthen the robust feature learning and suppress the non-robust feature learning to improve the network robustness. Finally, we also empirically validate our theoretical findings with experiments on real-image datasets, including MNIST, CIFAR10 and SVHN.

Adversarial Training Can Provably Improve Robustness: Theoretical Analysis of Feature Learning Process Under Structured Data

TL;DR

A theoretical understanding of adversarial examples and adversarial training algorithms from the perspective of feature learning theory is provided and it is shown that the adversarial training method can provably strengthen the robust feature learning and suppress the non-robust feature learning to improve the network robustness.

Abstract

Adversarial training is a widely-applied approach to training deep neural networks to be robust against adversarial perturbation. However, although adversarial training has achieved empirical success in practice, it still remains unclear why adversarial examples exist and how adversarial training methods improve model robustness. In this paper, we provide a theoretical understanding of adversarial examples and adversarial training algorithms from the perspective of feature learning theory. Specifically, we focus on a multiple classification setting, where the structured data can be composed of two types of features: the robust features, which are resistant to perturbation but sparse, and the non-robust features, which are susceptible to perturbation but dense. We train a two-layer smoothed ReLU convolutional neural network to learn our structured data. First, we prove that by using standard training (gradient descent over the empirical risk), the network learner primarily learns the non-robust feature rather than the robust feature, which thereby leads to the adversarial examples that are generated by perturbations aligned with negative non-robust feature directions. Then, we consider the gradient-based adversarial training algorithm, which runs gradient ascent to find adversarial examples and runs gradient descent over the empirical risk at adversarial examples to update models. We show that the adversarial training method can provably strengthen the robust feature learning and suppress the non-robust feature learning to improve the network robustness. Finally, we also empirically validate our theoretical findings with experiments on real-image datasets, including MNIST, CIFAR10 and SVHN.

Paper Structure

This paper contains 51 sections, 49 theorems, 64 equations, 9 figures, 2 tables.

Key Result

Proposition 3.1

We consider the special case when $m=1$ and $\boldsymbol{w}_{i,1} = \gamma \boldsymbol{v}_i$, where $\gamma > 0$ is a scale coefficient. Then, it holds that the standard empirical risk satisfies $\operatornamewithlimits{lim}_{\gamma\rightarrow\infty}\mathcal{L}_{\textit{CE}}(\boldsymbol{F}) = o(1)$,

Figures (9)

  • Figure 1: An overview of our paper: robust/non-robust-feature-decomposition-based framework and key messages about standard/adversarial training. And the robust/non-robust features of elephant and cat are generated in the same way of ilyas2019adversarial from random noise to ImageNet instances.
  • Figure 2: Illustration of our patch data: Each patch in data point $(\boldsymbol{X}, y)$ has the form $\boldsymbol{x}_p = \alpha_p \boldsymbol{u} + \boldsymbol{\xi}_p$ (robust-feature patch) or $\boldsymbol{x}_p = \beta_p \boldsymbol{v} + \boldsymbol{\xi}_p$ (non-robust-feature patch), where $\boldsymbol{u}, \boldsymbol{v}$ are the corresponding features for class $y$. For non-robust-feature patches, adversarial perturbation $\boldsymbol{\Delta}$ replaces non-robust feature $\boldsymbol{v}$ with other non-robust feature $\boldsymbol{v}'$ (corresponding to other class $y'$), which causes adversarial example $\Tilde{\boldsymbol{X}}$ with incorrect label $y'$ when the network learner trained by standard training mainly learns non-robust features $\boldsymbol{v},\boldsymbol{v}'$ rather than robust features $\boldsymbol{u}, \boldsymbol{u}'$. And we construct robust-feature/non-robust-feature data $\boldsymbol{X}_{\boldsymbol{u}}/\boldsymbol{X}_{\boldsymbol{v}}$ by replacing $\boldsymbol{v}/\boldsymbol{u}$ with all-zero vector $\boldsymbol{0}$.
  • Figure 3: Simulations on synthetic data.The two left figures: dynamics of normalized weight-feature correlations for std/adv training. The two right figures: learning curves for std/adv training.
  • Figure 4: MNIST
  • Figure 5: CIFAR10
  • ...and 4 more figures

Theorems & Definitions (89)

  • Definition 2.1: Robust and Non-robust Features
  • Definition 2.2: Patch Data Distribution
  • Remark 2.4
  • Remark 2.5
  • Proposition 3.1: The Existence of Non-robust Global Minima
  • Proposition 3.2: The Existence of Robust Global Minima
  • Definition 4.1: Feature Learning Accuracy
  • Remark 4.2
  • Theorem 4.3: Standard Training Converges to Non-robust Global Minima
  • Theorem 4.4: Adversarial Training Converges to Robust Global Minima
  • ...and 79 more