Adversarially Pretrained Transformers May Be Universally Robust In-Context Learners
Soichiro Kumano, Hiroshi Kera, Toshihiko Yamasaki
TL;DR
The paper investigates whether adversarially pretrained transformers can serve as universally robust foundation models capable of rapid, robust adaptation to unseen tasks via in-context learning. It develops a theoretical framework around single-layer linear transformers trained with adversarial pretraining across multiple datasets, distinguishing robust versus non-robust features and analyzing how in-context demonstrations guide robust adaptation under norm-bounded perturbations. The main contributions include a formal problem setup, a complete characterization of the global optima under different adversarial regimes, and a demonstration that adversarial pretraining can yield universal robustness across seen and unseen tasks under mild conditions, along with explicit trade-offs and open challenges. Empirical results on synthetic settings and standard benchmarks corroborate the theory, showing robustness gains at the cost of lower clean accuracy and illustrating the practical implications for designing universally robust in-context learners. Overall, the work lays foundational theory for universally robust foundation models and discusses practical considerations such as computational cost and sample complexity for real-world deployment.
Abstract
Adversarial training is one of the most effective adversarial defenses, but it incurs a high computational cost. In this study, we present the first theoretical analysis suggesting that adversarially pretrained transformers can serve as universally robust foundation models -- models that can robustly adapt to diverse downstream tasks with only lightweight tuning. Specifically, we demonstrate that single-layer linear transformers, after adversarial pretraining across a variety of classification tasks, can robustly generalize to unseen classification tasks through in-context learning from clean demonstrations (i.e., without requiring additional adversarial training or examples). This universal robustness stems from the model's ability to adaptively focus on robust features within given tasks. We also show the two open challenges for attaining robustness: accuracy--robustness trade-off and sample-hungry training. This study initiates the discussion on the utility of universally robust foundation models. While their training is expensive, the investment would prove worthwhile as downstream tasks can enjoy free adversarial robustness. The code is available at https://github.com/s-kumano/universally-robust-in-context-learner.
