How Smooth Is Attention?
Valérie Castin, Pierre Ablin, Gabriel Peyré
TL;DR
This work analyzes the smoothness of Transformer self-attention through local Lipschitz constants in both unmasked and masked settings. It develops Euclidean and mean-field formulations, establishing tight bounds that scale as √n in moderate regimes and become independent of n in a mean-field regime, with a separate large-radius regime showing similar behavior for almost all inputs. A novel mean-field framework for masked self-attention is introduced, along with a conditional-transport distance to quantify sequentially conditioned changes. Theoretical results are corroborated by experiments on BERT and GPT-2 that reveal real-data Lipschitz growth around n^{1/4} and adversarial constructions that reach the worst-case √n rate, highlighting implications for robustness and architectural design of Transformers.
Abstract
Self-attention and masked self-attention are at the heart of Transformers' outstanding success. Still, our mathematical understanding of attention, in particular of its Lipschitz properties - which are key when it comes to analyzing robustness and expressive power - is incomplete. We provide a detailed study of the Lipschitz constant of self-attention in several practical scenarios, discussing the impact of the sequence length $n$ and layer normalization on the local Lipschitz constant of both unmasked and masked self-attention. In particular, we show that for inputs of length $n$ in any compact set, the Lipschitz constant of self-attention is bounded by $\sqrt{n}$ up to a constant factor and that this bound is tight for reasonable sequence lengths. When the sequence length $n$ is too large for the previous bound to be tight, which we refer to as the mean-field regime, we provide an upper bound and a matching lower bound which are independent of $n$. Our mean-field framework for masked self-attention is novel and of independent interest. Our experiments on pretrained and randomly initialized BERT and GPT-2 support our theoretical findings.
