Table of Contents
Fetching ...

Provable Adversarial Robustness in In-Context Learning

Di Zhang

TL;DR

This work develops a distributionally robust framework for in-context learning (ICL) under adversarial distribution shifts using Wasserstein DRO. By analyzing a tractable linear self-attention Transformer, it derives a non-asymptotic upper bound on the worst-case meta-risk that decomposes into a nominal risk plus perturbation terms scaling with the shift radius $\rho$, model capacity $m$, and in-context size $N$. The key findings show that the robust radius scales as $\rho_{\max} \propto \sqrt{m}$ and the adversarial sample complexity grows as $N_\rho - N_0 \propto \rho^2$, with empirical validation on synthetic tasks confirming these scaling laws. The results formalize the sense in which model capacity is a fundamental resource for distributional robustness in ICL and offer guidance for robustness-aware design and evaluation, including the limits of post-hoc alignment and practical deployment strategies.

Abstract

Large language models adapt to new tasks through in-context learning (ICL) without parameter updates. Current theoretical explanations for this capability assume test tasks are drawn from a distribution similar to that seen during pretraining. This assumption overlooks adversarial distribution shifts that threaten real-world reliability. To address this gap, we introduce a distributionally robust meta-learning framework that provides worst-case performance guarantees for ICL under Wasserstein-based distribution shifts. Focusing on linear self-attention Transformers, we derive a non-asymptotic bound linking adversarial perturbation strength ($ρ$), model capacity ($m$), and the number of in-context examples ($N$). The analysis reveals that model robustness scales with the square root of its capacity ($ρ_{\text{max}} \propto \sqrt{m}$), while adversarial settings impose a sample complexity penalty proportional to the square of the perturbation magnitude ($N_ρ- N_0 \propto ρ^2$). Experiments on synthetic tasks confirm these scaling laws. These findings advance the theoretical understanding of ICL's limits under adversarial conditions and suggest that model capacity serves as a fundamental resource for distributional robustness.

Provable Adversarial Robustness in In-Context Learning

TL;DR

This work develops a distributionally robust framework for in-context learning (ICL) under adversarial distribution shifts using Wasserstein DRO. By analyzing a tractable linear self-attention Transformer, it derives a non-asymptotic upper bound on the worst-case meta-risk that decomposes into a nominal risk plus perturbation terms scaling with the shift radius , model capacity , and in-context size . The key findings show that the robust radius scales as and the adversarial sample complexity grows as , with empirical validation on synthetic tasks confirming these scaling laws. The results formalize the sense in which model capacity is a fundamental resource for distributional robustness in ICL and offer guidance for robustness-aware design and evaluation, including the limits of post-hoc alignment and practical deployment strategies.

Abstract

Large language models adapt to new tasks through in-context learning (ICL) without parameter updates. Current theoretical explanations for this capability assume test tasks are drawn from a distribution similar to that seen during pretraining. This assumption overlooks adversarial distribution shifts that threaten real-world reliability. To address this gap, we introduce a distributionally robust meta-learning framework that provides worst-case performance guarantees for ICL under Wasserstein-based distribution shifts. Focusing on linear self-attention Transformers, we derive a non-asymptotic bound linking adversarial perturbation strength (), model capacity (), and the number of in-context examples (). The analysis reveals that model robustness scales with the square root of its capacity (), while adversarial settings impose a sample complexity penalty proportional to the square of the perturbation magnitude (). Experiments on synthetic tasks confirm these scaling laws. These findings advance the theoretical understanding of ICL's limits under adversarial conditions and suggest that model capacity serves as a fundamental resource for distributional robustness.
Paper Structure (47 sections, 7 theorems, 24 equations, 3 figures, 1 table, 1 algorithm)

This paper contains 47 sections, 7 theorems, 24 equations, 3 figures, 1 table, 1 algorithm.

Key Result

Lemma 4.1

Consider a single-layer, multi-head linear self-attention model $f_\theta$, whose parameters $\theta$ encode the key, query, and value projection matrices. Suppose this model is pretrained on linear regression tasks as defined in Section sec:problem-formulation and converges to a global optimum $\th Furthermore, when the pretraining distribution is $\mathbb{P} = \mathcal{N}(0, \sigma_\beta^2 I_d)$

Figures (3)

  • Figure 1: Conceptual illustration of an adversarial distribution shift within the Wasserstein ball $\mathcal{B}_\rho(\mathbb{Q}_0)$. The nominal distribution $\mathbb{Q}_0$ (blue) and an adversarial distribution $\mathbb{Q}$ (red) are shown. The black arrow represents the Wasserstein distance $\rho$ between them. The background schematic connects task points to a Transformer via attention lines.
  • Figure 2: Schematic visualization of the worst-case meta-risk upper bound as a function of adversarial radius $\rho$ and model capacity $m$. The surface illustrates how risk increases quadratically with $\rho$ but becomes progressively flatter as $m$ increases, demonstrating the mitigating effect of model capacity on adversarial vulnerability.
  • Figure 3: Empirical validation of theoretical predictions.

Theorems & Definitions (12)

  • Lemma 4.1: Ridge Regression Equivalence of Linear Transformers
  • Definition 4.2: Lipschitz Continuity of the Predictor
  • Lemma 4.3: Gradient Bound and Lipschitz Constant
  • Theorem 4.4: Worst-case Meta-Risk Upper Bound
  • Corollary 4.5: Safe Radius and Model Capacity
  • Corollary 4.6: Sample Complexity for Adversarial ICL
  • Corollary 4.7: Dual Role of Regularization Strength $\lambda_N$
  • proof
  • proof
  • proof
  • ...and 2 more