Provable Adversarial Robustness in In-Context Learning

Di Zhang

Provable Adversarial Robustness in In-Context Learning

Di Zhang

TL;DR

This work develops a distributionally robust framework for in-context learning (ICL) under adversarial distribution shifts using Wasserstein DRO. By analyzing a tractable linear self-attention Transformer, it derives a non-asymptotic upper bound on the worst-case meta-risk that decomposes into a nominal risk plus perturbation terms scaling with the shift radius $\rho$, model capacity $m$, and in-context size $N$. The key findings show that the robust radius scales as $\rho_{\max} \propto \sqrt{m}$ and the adversarial sample complexity grows as $N_\rho - N_0 \propto \rho^2$, with empirical validation on synthetic tasks confirming these scaling laws. The results formalize the sense in which model capacity is a fundamental resource for distributional robustness in ICL and offer guidance for robustness-aware design and evaluation, including the limits of post-hoc alignment and practical deployment strategies.

Abstract

Large language models adapt to new tasks through in-context learning (ICL) without parameter updates. Current theoretical explanations for this capability assume test tasks are drawn from a distribution similar to that seen during pretraining. This assumption overlooks adversarial distribution shifts that threaten real-world reliability. To address this gap, we introduce a distributionally robust meta-learning framework that provides worst-case performance guarantees for ICL under Wasserstein-based distribution shifts. Focusing on linear self-attention Transformers, we derive a non-asymptotic bound linking adversarial perturbation strength ($ρ$), model capacity ($m$), and the number of in-context examples ($N$). The analysis reveals that model robustness scales with the square root of its capacity ($ρ_{\text{max}} \propto \sqrt{m}$), while adversarial settings impose a sample complexity penalty proportional to the square of the perturbation magnitude ($N_ρ- N_0 \propto ρ^2$). Experiments on synthetic tasks confirm these scaling laws. These findings advance the theoretical understanding of ICL's limits under adversarial conditions and suggest that model capacity serves as a fundamental resource for distributional robustness.

Provable Adversarial Robustness in In-Context Learning

TL;DR

, model capacity

, and in-context size

. The key findings show that the robust radius scales as

and the adversarial sample complexity grows as

, with empirical validation on synthetic tasks confirming these scaling laws. The results formalize the sense in which model capacity is a fundamental resource for distributional robustness in ICL and offer guidance for robustness-aware design and evaluation, including the limits of post-hoc alignment and practical deployment strategies.

Abstract

), model capacity (

), and the number of in-context examples (

). The analysis reveals that model robustness scales with the square root of its capacity (

), while adversarial settings impose a sample complexity penalty proportional to the square of the perturbation magnitude (

). Experiments on synthetic tasks confirm these scaling laws. These findings advance the theoretical understanding of ICL's limits under adversarial conditions and suggest that model capacity serves as a fundamental resource for distributional robustness.

Paper Structure (47 sections, 7 theorems, 24 equations, 3 figures, 1 table, 1 algorithm)

This paper contains 47 sections, 7 theorems, 24 equations, 3 figures, 1 table, 1 algorithm.

Introduction
Related Work
Problem Formulation: Distributionally Robust ICL
Standard In-Context Learning Setup
Adversarial Shift and Wasserstein Uncertainty
Distributionally Robust Meta-Risk
Tractable Linear Transformer Model and Assumptions
Theoretical Analysis: Robustness Guarantees for Linear Transformers
Preliminaries and Lemmas
Main Theorem: Upper Bound on Worst-case Meta-Risk
Step 1: Risk Decomposition and Dual Representation.
Step 2: Lipschitzness and Variance Control of the Loss.
Step 3: Optimizing the Dual Variable and Combining Bounds.
Step 4: Characterization of Nominal Risk $\mathcal{L}_{\mathbb{Q}_0}(\theta^*)$.
Corollaries: Robustness-Capacity-Sample Size Trade-offs
...and 32 more sections

Key Result

Lemma 4.1

Consider a single-layer, multi-head linear self-attention model $f_\theta$, whose parameters $\theta$ encode the key, query, and value projection matrices. Suppose this model is pretrained on linear regression tasks as defined in Section sec:problem-formulation and converges to a global optimum $\th Furthermore, when the pretraining distribution is $\mathbb{P} = \mathcal{N}(0, \sigma_\beta^2 I_d)$

Figures (3)

Figure 1: Conceptual illustration of an adversarial distribution shift within the Wasserstein ball $\mathcal{B}_\rho(\mathbb{Q}_0)$. The nominal distribution $\mathbb{Q}_0$ (blue) and an adversarial distribution $\mathbb{Q}$ (red) are shown. The black arrow represents the Wasserstein distance $\rho$ between them. The background schematic connects task points to a Transformer via attention lines.
Figure 2: Schematic visualization of the worst-case meta-risk upper bound as a function of adversarial radius $\rho$ and model capacity $m$. The surface illustrates how risk increases quadratically with $\rho$ but becomes progressively flatter as $m$ increases, demonstrating the mitigating effect of model capacity on adversarial vulnerability.
Figure 3: Empirical validation of theoretical predictions.

Theorems & Definitions (12)

Lemma 4.1: Ridge Regression Equivalence of Linear Transformers
Definition 4.2: Lipschitz Continuity of the Predictor
Lemma 4.3: Gradient Bound and Lipschitz Constant
Theorem 4.4: Worst-case Meta-Risk Upper Bound
Corollary 4.5: Safe Radius and Model Capacity
Corollary 4.6: Sample Complexity for Adversarial ICL
Corollary 4.7: Dual Role of Regularization Strength $\lambda_N$
proof
proof
proof
...and 2 more

Provable Adversarial Robustness in In-Context Learning

TL;DR

Abstract

Provable Adversarial Robustness in In-Context Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (12)