The Lipschitz Constant of Self-Attention

Hyunjik Kim; George Papamakarios; Andriy Mnih

The Lipschitz Constant of Self-Attention

Hyunjik Kim, George Papamakarios, Andriy Mnih

TL;DR

This work analyzes the Lipschitz properties of self-attention, showing that standard dot-product multi-head attention is not Lipschitz on unbounded input domains. It then introduces L2 self-attention, derives explicit upper bounds on its Lipschitz constants in both $\ell_{\infty}$ and $\ell_2$ norms, and demonstrates asymptotic tightness experimentally. Building on these results, the authors formulate invertible self-attention by constraining the attention mechanism (Contractive-L2-MHA) and validate it within a Transformer-style architecture on a character-level language modelling task, noting improved training stability. The work highlights practical pathways for integrating Lipschitz constraints into Transformer models, enabling applications such as invertible residual networks, Neural ODEs, and density estimation, while outlining future directions for alternative kernels and bounded-input strategies.

Abstract

Lipschitz constants of neural networks have been explored in various contexts in deep learning, such as provable adversarial robustness, estimating Wasserstein distance, stabilising training of GANs, and formulating invertible neural networks. Such works have focused on bounding the Lipschitz constant of fully connected or convolutional networks, composed of linear maps and pointwise non-linearities. In this paper, we investigate the Lipschitz constant of self-attention, a non-linear neural network module widely used in sequence modelling. We prove that the standard dot-product self-attention is not Lipschitz for unbounded input domain, and propose an alternative L2 self-attention that is Lipschitz. We derive an upper bound on the Lipschitz constant of L2 self-attention and provide empirical evidence for its asymptotic tightness. To demonstrate the practical relevance of our theoretical work, we formulate invertible self-attention and use it in a Transformer-based architecture for a character-level language modelling task.

The Lipschitz Constant of Self-Attention

TL;DR

Abstract

The Lipschitz Constant of Self-Attention

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (25)