Better Representations via Adversarial Training in Pre-Training: A Theoretical Perspective
Yue Xing, Xiaofeng Lin, Qifan Song, Yi Xu, Belinda Zeng, Guang Cheng
TL;DR
The paper tackles why adversarially pre-trained models can confer robustness to downstream tasks. It develops a theoretical framework for feature purification in two-layer networks, showing that adversarial training pushes hidden units to specialize to a small subset of features, enabling clean down-stream training to inherit robustness. It extends the purification concept to contrastive pre-training, analyzes downstream robustness, and validates findings with real-data experiments demonstrating robustness inheritance and feature purification. The work offers guidance for designing robust pre-trained representations and suggests that purification, not just training, underpins robust downstream performance.
Abstract
Pre-training is known to generate universal representations for downstream tasks in large-scale deep learning such as large language models. Existing literature, e.g., \cite{kim2020adversarial}, empirically observe that the downstream tasks can inherit the adversarial robustness of the pre-trained model. We provide theoretical justifications for this robustness inheritance phenomenon. Our theoretical results reveal that feature purification plays an important role in connecting the adversarial robustness of the pre-trained model and the downstream tasks in two-layer neural networks. Specifically, we show that (i) with adversarial training, each hidden node tends to pick only one (or a few) feature; (ii) without adversarial training, the hidden nodes can be vulnerable to attacks. This observation is valid for both supervised pre-training and contrastive learning. With purified nodes, it turns out that clean training is enough to achieve adversarial robustness in downstream tasks.
