The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit
Lorenzo Noci, Chuning Li, Mufan Bill Li, Bobby He, Thomas Hofmann, Chris Maddison, Daniel M. Roy
TL;DR
This work addresses the instability and rank-degeneracy (rank-collapse) observed in Softmax-based attention within deep Transformer architectures. It introduces the shaped attention mechanism, combining centering around the identity and a width-dependent temperature, to stabilize forward and backward covariance propagation in the proportional infinite-depth-and-width limit. The authors derive neural covariance SDEs that characterize the initial distribution for shaped Attention and Shaped Transformer blocks, and demonstrate local convergence (in the Skorohod sense) to these SDEs, with explicit drift and diffusion terms encoding the influence of attention and residual connections. Simulations show the SDE descriptions closely track finite-size networks, and preliminary experiments suggest shaped Transformers can train with competitive stability and performance. Overall, the paper provides a tractable, non-commutative limiting theory for Transformer-like architectures, offering design principles and hyperparameter guidance for stable, scalable deep attention models.
Abstract
In deep learning theory, the covariance matrix of the representations serves as a proxy to examine the network's trainability. Motivated by the success of Transformers, we study the covariance matrix of a modified Softmax-based attention model with skip connections in the proportional limit of infinite-depth-and-width. We show that at initialization the limiting distribution can be described by a stochastic differential equation (SDE) indexed by the depth-to-width ratio. To achieve a well-defined stochastic limit, the Transformer's attention mechanism is modified by centering the Softmax output at identity, and scaling the Softmax logits by a width-dependent temperature parameter. We examine the stability of the network through the corresponding SDE, showing how the scale of both the drift and diffusion can be elegantly controlled with the aid of residual connections. The existence of a stable SDE implies that the covariance structure is well-behaved, even for very large depth and width, thus preventing the notorious issues of rank degeneracy in deep attention models. Finally, we show, through simulations, that the SDE provides a surprisingly good description of the corresponding finite-size model. We coin the name shaped Transformer for these architectural modifications.
