Table of Contents
Fetching ...

Adversarially Pretrained Transformers May Be Universally Robust In-Context Learners

Soichiro Kumano, Hiroshi Kera, Toshihiko Yamasaki

TL;DR

The paper investigates whether adversarially pretrained transformers can serve as universally robust foundation models capable of rapid, robust adaptation to unseen tasks via in-context learning. It develops a theoretical framework around single-layer linear transformers trained with adversarial pretraining across multiple datasets, distinguishing robust versus non-robust features and analyzing how in-context demonstrations guide robust adaptation under norm-bounded perturbations. The main contributions include a formal problem setup, a complete characterization of the global optima under different adversarial regimes, and a demonstration that adversarial pretraining can yield universal robustness across seen and unseen tasks under mild conditions, along with explicit trade-offs and open challenges. Empirical results on synthetic settings and standard benchmarks corroborate the theory, showing robustness gains at the cost of lower clean accuracy and illustrating the practical implications for designing universally robust in-context learners. Overall, the work lays foundational theory for universally robust foundation models and discusses practical considerations such as computational cost and sample complexity for real-world deployment.

Abstract

Adversarial training is one of the most effective adversarial defenses, but it incurs a high computational cost. In this study, we present the first theoretical analysis suggesting that adversarially pretrained transformers can serve as universally robust foundation models -- models that can robustly adapt to diverse downstream tasks with only lightweight tuning. Specifically, we demonstrate that single-layer linear transformers, after adversarial pretraining across a variety of classification tasks, can robustly generalize to unseen classification tasks through in-context learning from clean demonstrations (i.e., without requiring additional adversarial training or examples). This universal robustness stems from the model's ability to adaptively focus on robust features within given tasks. We also show the two open challenges for attaining robustness: accuracy--robustness trade-off and sample-hungry training. This study initiates the discussion on the utility of universally robust foundation models. While their training is expensive, the investment would prove worthwhile as downstream tasks can enjoy free adversarial robustness. The code is available at https://github.com/s-kumano/universally-robust-in-context-learner.

Adversarially Pretrained Transformers May Be Universally Robust In-Context Learners

TL;DR

The paper investigates whether adversarially pretrained transformers can serve as universally robust foundation models capable of rapid, robust adaptation to unseen tasks via in-context learning. It develops a theoretical framework around single-layer linear transformers trained with adversarial pretraining across multiple datasets, distinguishing robust versus non-robust features and analyzing how in-context demonstrations guide robust adaptation under norm-bounded perturbations. The main contributions include a formal problem setup, a complete characterization of the global optima under different adversarial regimes, and a demonstration that adversarial pretraining can yield universal robustness across seen and unseen tasks under mild conditions, along with explicit trade-offs and open challenges. Empirical results on synthetic settings and standard benchmarks corroborate the theory, showing robustness gains at the cost of lower clean accuracy and illustrating the practical implications for designing universally robust in-context learners. Overall, the work lays foundational theory for universally robust foundation models and discusses practical considerations such as computational cost and sample complexity for real-world deployment.

Abstract

Adversarial training is one of the most effective adversarial defenses, but it incurs a high computational cost. In this study, we present the first theoretical analysis suggesting that adversarially pretrained transformers can serve as universally robust foundation models -- models that can robustly adapt to diverse downstream tasks with only lightweight tuning. Specifically, we demonstrate that single-layer linear transformers, after adversarial pretraining across a variety of classification tasks, can robustly generalize to unseen classification tasks through in-context learning from clean demonstrations (i.e., without requiring additional adversarial training or examples). This universal robustness stems from the model's ability to adaptively focus on robust features within given tasks. We also show the two open challenges for attaining robustness: accuracy--robustness trade-off and sample-hungry training. This study initiates the discussion on the utility of universally robust foundation models. While their training is expensive, the investment would prove worthwhile as downstream tasks can enjoy free adversarial robustness. The code is available at https://github.com/s-kumano/universally-robust-in-context-learner.

Paper Structure

This paper contains 37 sections, 17 theorems, 122 equations, 6 figures, 1 table.

Key Result

Lemma 3.2

The minimization problem optim-problem can be transformed into the maximization problem $\max_{\bm{b} \in \{0, 1\}^{d+1}} \sum^{d(d+1)}_{i=1} \max (0, \sum^{d+1}_{j=1} b_j h_{i,j})$, where $h_{i,j} \in \mathbb{R}$ is an $(i, j)$-dependent constant, and there exists a mapping from $\bm{b}$ to $\bm{P}

Figures (6)

  • Figure 1: Parameter heatmaps induced by adversarial training \ref{['optim-problem']} with $d=20$ and $\lambda = 0.1$. For the standard, adversarial, and strong adversarial regimes, we used $\epsilon = 0$, $\frac{1 + (d - 1) (\lambda / 2)}{d} = 0.098$, and $\frac{\lambda}{2} + \frac{3}{2} \frac{ 2 - \lambda }{ (d - 1) \lambda^2 + 3 } = 0.95$, respectively. We optimized \ref{['optim-problem']} by stochastic gradient descent. Detailed experimental settings can be found in \ref{['sec:additional-exp']}.
  • Figure A2: Statistical properties of preprocessed MNIST, Fashion-MNIST, and CIFAR-10 datasets. First row: Blue lines represent the mean of $(1/N) \sum^N_{n=1} y_n \bm{x}_n$ across 45 binary class pairs and shaded regions represent the sample standard deviation. Orange lines represent typical perturbation magnitude. Green dashed lines represent the (pseudo) threshold between robust and non-robust dimensions. Second row: Blue lines represent the total covariance of each dimension with other dimensions and shaded regions represent sample standard deviation across the 45 binary class pairs. Green dashed lines represent the boundary between positive and negative total covariance.
  • Figure A3: Parameter heatmaps induced by adversarial training \ref{['optim-problem']} with $d=100$ and $\lambda = 0.1$. For the standard, adversarial, and strong adversarial regimes, we used $\epsilon = 0$, $\frac{1 + (d - 1) (\lambda / 2)}{d} = 0.06$, and $\frac{\lambda}{2} + \frac{3}{2} \frac{ 2 - \lambda }{ (d - 1) \lambda^2 + 3 } = 0.77$, respectively. We optimized \ref{['optim-problem']} by stochastic gradient descent.
  • Figure A4: Accuracy (%) of normally and adversarially pretrained single-layer transformers. Lines represent mean accuracy across batches and shaded regions represent unbiased standard deviation (notably small in magnitude). We used 1,000 batches, each containing 1,000 in-context demonstrations ($N = 1000$) and 1,000 query examples. Base configuration parameters were $d = 100$, $\lambda = 0.1$, and $\epsilon = 0.15$.
  • Figure A5: Accuracy (%) of normally and adversarially pretrained single-layer transformers. Lines represent mean accuracy across batches and shaded regions represent unbiased standard deviation. We used 1,000 batches, each containing 1,000 in-context demonstrations ($N = 1000$) and 1,000 query examples. Base configuration parameters were $d_\mathrm{rob} = 10$, $d_\mathrm{vul} = 90$, $d_\mathrm{irr} = 0$, $\alpha = 1.0$, $\beta = 0.1$, $\gamma = 0.1$, and $\epsilon = 0.2$.
  • ...and 1 more figures

Theorems & Definitions (28)

  • Lemma 3.2: Transformation of original optimization problem
  • Theorem 3.3: Parameters induced by adversarial pretraining
  • Theorem 3.4: Standard pretraining case
  • Theorem 3.5: Adversarial pretraining case
  • Theorem 3.6: Accuracy--robustness trade-off
  • Lemma D.0: Transformation of original optimization problem
  • proof
  • Theorem D.1: Parameters induced by adversarial pretraining
  • proof
  • Theorem D.1: General case of \ref{['th:pretraining']}
  • ...and 18 more