Table of Contents
Fetching ...

Spectral regularization for adversarially-robust representation learning

Sheng Yang, Jacob A. Zavatone-Veth, Cengiz Pehlevan

TL;DR

This work addresses adversarial robustness in representation-learning contexts where downstream inferences rely on fixed feature representations. It derives a theoretical bound linking robustness to the feature-map gradient and last-layer alignment, and introduces a rep-spectral regularizer that penalizes the top singular values of hidden-layer weight matrices to constrain gradient norms. Empirically, rep-spectral improves adversarial distances and often classification performance across supervised, self-supervised, and transfer-learning tasks, sometimes outperforming prior ll-spectral regularization. The findings suggest that shaping the representation space, not just the final readout, is crucial for robustness, with practical benefits for SSL pretraining and cross-task transfer, albeit with some limitations in SSL gains and finetuning sensitivity.

Abstract

The vulnerability of neural network classifiers to adversarial attacks is a major obstacle to their deployment in safety-critical applications. Regularization of network parameters during training can be used to improve adversarial robustness and generalization performance. Usually, the network is regularized end-to-end, with parameters at all layers affected by regularization. However, in settings where learning representations is key, such as self-supervised learning (SSL), layers after the feature representation will be discarded when performing inference. For these models, regularizing up to the feature space is more suitable. To this end, we propose a new spectral regularizer for representation learning that encourages black-box adversarial robustness in downstream classification tasks. In supervised classification settings, we show empirically that this method is more effective in boosting test accuracy and robustness than previously-proposed methods that regularize all layers of the network. We then show that this method improves the adversarial robustness of classifiers using representations learned with self-supervised training or transferred from another classification task. In all, our work begins to unveil how representational structure affects adversarial robustness.

Spectral regularization for adversarially-robust representation learning

TL;DR

This work addresses adversarial robustness in representation-learning contexts where downstream inferences rely on fixed feature representations. It derives a theoretical bound linking robustness to the feature-map gradient and last-layer alignment, and introduces a rep-spectral regularizer that penalizes the top singular values of hidden-layer weight matrices to constrain gradient norms. Empirically, rep-spectral improves adversarial distances and often classification performance across supervised, self-supervised, and transfer-learning tasks, sometimes outperforming prior ll-spectral regularization. The findings suggest that shaping the representation space, not just the final readout, is crucial for robustness, with practical benefits for SSL pretraining and cross-task transfer, albeit with some limitations in SSL gains and finetuning sensitivity.

Abstract

The vulnerability of neural network classifiers to adversarial attacks is a major obstacle to their deployment in safety-critical applications. Regularization of network parameters during training can be used to improve adversarial robustness and generalization performance. Usually, the network is regularized end-to-end, with parameters at all layers affected by regularization. However, in settings where learning representations is key, such as self-supervised learning (SSL), layers after the feature representation will be discarded when performing inference. For these models, regularizing up to the feature space is more suitable. To this end, we propose a new spectral regularizer for representation learning that encourages black-box adversarial robustness in downstream classification tasks. In supervised classification settings, we show empirically that this method is more effective in boosting test accuracy and robustness than previously-proposed methods that regularize all layers of the network. We then show that this method improves the adversarial robustness of classifiers using representations learned with self-supervised training or transferred from another classification task. In all, our work begins to unveil how representational structure affects adversarial robustness.
Paper Structure (29 sections, 3 theorems, 19 equations, 20 figures, 1 algorithm)

This paper contains 29 sections, 3 theorems, 19 equations, 20 figures, 1 algorithm.

Key Result

Lemma 1

Suppose a neural network classifier $F: \mathbb{R}^n \rightarrow \mathbb{R}^K$ is continuously differentiable and can be decomposed as eq:decomp, with $f(z) = W^{(L)}z$ the last layer linear transformation without bias, and $W^{(L)}_c$ the $c$-th row of $W^{(L)}$, treated as a row vector. Suppose $x where $\theta_{x} = \frac{(W^{(L)}_c - W^{(L)}_k ) \Phi(x)}{\Vert W^{(L)}_c - W^{(L)}_k\Vert_2\Vert

Figures (20)

  • Figure 1: Tradeoff between prediction accuracy and adversarial robustness in a simple binary classification task. The first scenario has 50% accuracy, but the adversarial distances for the correct samples are large. The second case correctly classifiers every sample, but does not have large adversarial distances for each. The last scenario is the ideal: a large-margin solution.
  • Figure 2: Average $\Delta_x$ found by TA across 10 different seeds for models trained on XOR with (right) and without (left) weight decay, and with the inclusion of our proposed rep-spectral regularizer, the ll-spectral regularizer that includes all layers, or no additional spectral regularizer. Error bars show $\pm 1$ standard deviation over seeds.
  • Figure 3: Effect of spectral regularization on the representations of MLPs trained on the toy XOR task. (a). Direct visualization of the decision boundaries of models trained using no regularization (left), ll-spectral regularization (middle), and our proposed rep-spectral regularization (right). The four training points are shown, colored according to their class. (b). Visualization of the volume element (see \ref{['rmk:metric']}), which measures the sensitivity of the representation to small variations in the input, for models trained with each of these three methods. For details, see \ref{['sup:shallow']}.
  • Figure 4: Spectral regularization during supervised training of a single-hidden-layer MLP on MNIST images improves robustness. For each regularization method, we show text accuracy (left) and adversarial distance averaged across 1000 samples (right) across 10 random seeds. Gray dots indicate the results for individual seeds, while boxplots show the mean and quartiles.
  • Figure 5: Re-training a readout from the hidden representation of an MLP pretrained on MNIST. For each regularization method, we show text accuracy (left) and adversarial distance averaged across 1000 samples (right) across 10 random seeds. Gray dots indicate the results for individual seeds, while boxplots show the mean and quartiles.
  • ...and 15 more figures

Theorems & Definitions (9)

  • Lemma 1
  • proof
  • Remark 1
  • Remark 2
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof