Only Large Weights (And Not Skip Connections) Can Prevent the Perils of Rank Collapse

Josh Alman; Zhao Song

Only Large Weights (And Not Skip Connections) Can Prevent the Perils of Rank Collapse

Josh Alman, Zhao Song

TL;DR

This work proves that tiny per-layer attention weights induce a layer-collapse phenomenon in Self-Attention Networks, effectively reducing an L-layer transformer to a single-layer surrogate and making the quadratic-time attention computation unavoidable for expressive models. By developing perturbation bounds for the Res operator, exponential, and softmax, the authors connect layer outputs to rank-like structure and show that skip connections do not avert collapse under small weights. The central result is that for weights with $\|W_q\|_{\infty}, \|W_k\|_{\infty}, \|W_v\|_{\infty} \leq \eta$, there exists a one-layer network $S'$ with $\|S(X) - S'(X)\|_{\infty} \leq O(\eta) \cdot \|X\|_{\infty}$ for all inputs $X$, and this bound can be iteratively applied to collapse the entire network. Consequently, large weights (not skip connections) are essential to avoid layer- and rank-collapse, with implications for Transformer expressivity and design, particularly regarding weight scaling and regularization. The results illuminate fundamental efficiency–expressivity trade-offs in attention mechanisms and underscore the limits of subquadratic attention for highly expressive models.

Abstract

Attention mechanisms lie at the heart of modern large language models (LLMs). Straightforward algorithms for forward and backward (gradient) computation take quadratic time, and a line of work initiated by [Alman and Song NeurIPS 2023] and [Alman and Song NeurIPS 2024] has shown that quadratic time is necessary unless the model weights are small, in which case almost linear time algorithms are possible. In this paper, we show that large weights are necessary to avoid a strong preclusion to representational strength we call layer collapse, which means that the entire network can be approximated well by a network with only a single layer. Thus, the quadratic running time of attention is unavoidable for expressive transformers. The notion of layer collapse that we introduce is a variant on the notion of rank collapse from the work of [Dong, Cordonnier, and Loukas ICML 2021]. They showed that in Self Attention Networks with small weights and with skip connections, rank collapse must occur. This is typically interpreted as justifying the necessity of skip connections in expressive networks. However, our result shows that even with skip connections, if the weights are small, then layer collapse still occurs. Thus, only large weights, and not skip connections, can prevent these representational weaknesses.

Only Large Weights (And Not Skip Connections) Can Prevent the Perils of Rank Collapse

TL;DR

Abstract

Only Large Weights (And Not Skip Connections) Can Prevent the Perils of Rank Collapse

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (44)