Provable Adversarial Robustness in In-Context Learning
Di Zhang
TL;DR
This work develops a distributionally robust framework for in-context learning (ICL) under adversarial distribution shifts using Wasserstein DRO. By analyzing a tractable linear self-attention Transformer, it derives a non-asymptotic upper bound on the worst-case meta-risk that decomposes into a nominal risk plus perturbation terms scaling with the shift radius $\rho$, model capacity $m$, and in-context size $N$. The key findings show that the robust radius scales as $\rho_{\max} \propto \sqrt{m}$ and the adversarial sample complexity grows as $N_\rho - N_0 \propto \rho^2$, with empirical validation on synthetic tasks confirming these scaling laws. The results formalize the sense in which model capacity is a fundamental resource for distributional robustness in ICL and offer guidance for robustness-aware design and evaluation, including the limits of post-hoc alignment and practical deployment strategies.
Abstract
Large language models adapt to new tasks through in-context learning (ICL) without parameter updates. Current theoretical explanations for this capability assume test tasks are drawn from a distribution similar to that seen during pretraining. This assumption overlooks adversarial distribution shifts that threaten real-world reliability. To address this gap, we introduce a distributionally robust meta-learning framework that provides worst-case performance guarantees for ICL under Wasserstein-based distribution shifts. Focusing on linear self-attention Transformers, we derive a non-asymptotic bound linking adversarial perturbation strength ($ρ$), model capacity ($m$), and the number of in-context examples ($N$). The analysis reveals that model robustness scales with the square root of its capacity ($ρ_{\text{max}} \propto \sqrt{m}$), while adversarial settings impose a sample complexity penalty proportional to the square of the perturbation magnitude ($N_ρ- N_0 \propto ρ^2$). Experiments on synthetic tasks confirm these scaling laws. These findings advance the theoretical understanding of ICL's limits under adversarial conditions and suggest that model capacity serves as a fundamental resource for distributional robustness.
