Spectral regularization for adversarially-robust representation learning
Sheng Yang, Jacob A. Zavatone-Veth, Cengiz Pehlevan
TL;DR
This work addresses adversarial robustness in representation-learning contexts where downstream inferences rely on fixed feature representations. It derives a theoretical bound linking robustness to the feature-map gradient and last-layer alignment, and introduces a rep-spectral regularizer that penalizes the top singular values of hidden-layer weight matrices to constrain gradient norms. Empirically, rep-spectral improves adversarial distances and often classification performance across supervised, self-supervised, and transfer-learning tasks, sometimes outperforming prior ll-spectral regularization. The findings suggest that shaping the representation space, not just the final readout, is crucial for robustness, with practical benefits for SSL pretraining and cross-task transfer, albeit with some limitations in SSL gains and finetuning sensitivity.
Abstract
The vulnerability of neural network classifiers to adversarial attacks is a major obstacle to their deployment in safety-critical applications. Regularization of network parameters during training can be used to improve adversarial robustness and generalization performance. Usually, the network is regularized end-to-end, with parameters at all layers affected by regularization. However, in settings where learning representations is key, such as self-supervised learning (SSL), layers after the feature representation will be discarded when performing inference. For these models, regularizing up to the feature space is more suitable. To this end, we propose a new spectral regularizer for representation learning that encourages black-box adversarial robustness in downstream classification tasks. In supervised classification settings, we show empirically that this method is more effective in boosting test accuracy and robustness than previously-proposed methods that regularize all layers of the network. We then show that this method improves the adversarial robustness of classifiers using representations learned with self-supervised training or transferred from another classification task. In all, our work begins to unveil how representational structure affects adversarial robustness.
