Table of Contents
Fetching ...

Better Representations via Adversarial Training in Pre-Training: A Theoretical Perspective

Yue Xing, Xiaofeng Lin, Qifan Song, Yi Xu, Belinda Zeng, Guang Cheng

TL;DR

The paper tackles why adversarially pre-trained models can confer robustness to downstream tasks. It develops a theoretical framework for feature purification in two-layer networks, showing that adversarial training pushes hidden units to specialize to a small subset of features, enabling clean down-stream training to inherit robustness. It extends the purification concept to contrastive pre-training, analyzes downstream robustness, and validates findings with real-data experiments demonstrating robustness inheritance and feature purification. The work offers guidance for designing robust pre-trained representations and suggests that purification, not just training, underpins robust downstream performance.

Abstract

Pre-training is known to generate universal representations for downstream tasks in large-scale deep learning such as large language models. Existing literature, e.g., \cite{kim2020adversarial}, empirically observe that the downstream tasks can inherit the adversarial robustness of the pre-trained model. We provide theoretical justifications for this robustness inheritance phenomenon. Our theoretical results reveal that feature purification plays an important role in connecting the adversarial robustness of the pre-trained model and the downstream tasks in two-layer neural networks. Specifically, we show that (i) with adversarial training, each hidden node tends to pick only one (or a few) feature; (ii) without adversarial training, the hidden nodes can be vulnerable to attacks. This observation is valid for both supervised pre-training and contrastive learning. With purified nodes, it turns out that clean training is enough to achieve adversarial robustness in downstream tasks.

Better Representations via Adversarial Training in Pre-Training: A Theoretical Perspective

TL;DR

The paper tackles why adversarially pre-trained models can confer robustness to downstream tasks. It develops a theoretical framework for feature purification in two-layer networks, showing that adversarial training pushes hidden units to specialize to a small subset of features, enabling clean down-stream training to inherit robustness. It extends the purification concept to contrastive pre-training, analyzes downstream robustness, and validates findings with real-data experiments demonstrating robustness inheritance and feature purification. The work offers guidance for designing robust pre-trained representations and suggests that purification, not just training, underpins robust downstream performance.

Abstract

Pre-training is known to generate universal representations for downstream tasks in large-scale deep learning such as large language models. Existing literature, e.g., \cite{kim2020adversarial}, empirically observe that the downstream tasks can inherit the adversarial robustness of the pre-trained model. We provide theoretical justifications for this robustness inheritance phenomenon. Our theoretical results reveal that feature purification plays an important role in connecting the adversarial robustness of the pre-trained model and the downstream tasks in two-layer neural networks. Specifically, we show that (i) with adversarial training, each hidden node tends to pick only one (or a few) feature; (ii) without adversarial training, the hidden nodes can be vulnerable to attacks. This observation is valid for both supervised pre-training and contrastive learning. With purified nodes, it turns out that clean training is enough to achieve adversarial robustness in downstream tasks.
Paper Structure (57 sections, 12 theorems, 105 equations, 11 figures, 7 tables)

This paper contains 57 sections, 12 theorems, 105 equations, 11 figures, 7 tables.

Key Result

Lemma 4.2

Assume $\epsilon = O(1/(\log(d)\sqrt{m^* k}))$, and $(W,b)\in\mathcal{M}$. Denote $\mathcal{X}$ as the set of coordinate $i$ where $|X_i|>0$. Assume $Ua=\theta$, $\|\theta\|_{\infty}=\Theta(1)$. With probability tending to 1 over the randomness of $\xi$ and $X$, where "$o$" represents a negligible term caused by the curvature of the loss. In probability, the $\theta_\mathcal{X}$ is the vector of

Figures (11)

  • Figure 1: A proof-of-concept example of the Sparse Coding Model. For the categorical features, one can reshape it to a sparse feature vector.
  • Figure 2: With purified hidden nodes, only the active features will be attacked, and the resulting adversarial loss is small. With unpurified hidden nodes, inactive features will also be impacted. Note that we transform the attack on the observable $Z$ back to its features $X$, to compare with $\theta_0$.
  • Figure 3: Adversary attacks on dissimilar pairs, but have little effect on similar pairs.
  • Figure 4: Left: Clean/adversarial contrastive testing loss under different levels of purification of the hidden nodes, for similar data pairs (i.e., $Y=1$) and dissimilar data pairs (i.e., $Y=-1$). Note that the blue and yellow curves overlap. Right: How $\alpha$ is related to $m$. The values of $\gamma_1$ and $\gamma_2$ are assumed to be in $\Theta(\alpha)$ and $\Theta(\alpha^2)$ respectively in Theorem \ref{['lem:robust_similar']}.
  • Figure 5: Learned features in the input convolutional layer trained on CIFAR-10.
  • ...and 6 more figures

Theorems & Definitions (25)

  • Definition 4.1
  • Lemma 4.2
  • Theorem 4.3: Informal Statement
  • Lemma 5.1: Basic Properties of Contrastive Learning
  • Theorem 5.2
  • Proposition 6.1
  • Theorem A.1
  • Theorem A.2
  • Lemma F.1
  • proof : Proof of Lemma \ref{['lem:1']}
  • ...and 15 more