Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis
Ta-Chung Chi, Ting-Han Fan, Alexander I. Rudnicky, Peter J. Ramadge
TL;DR
The paper investigates why length extrapolation works in transformer language models by examining the role of positional embeddings through a receptive-field lens. It introduces a cumulative gradient-based empirical receptive field tool and uses it to analyze ALiBi and windowed attention, showing their extrapolation depends on training length covering the empirical receptive field. To address limitations, the authors propose Sandwich, a parameter-free relative positional embedding derived from sinusoidal embeddings that can leverage information beyond the training length, with patterns similar to KERPLE and T5. The work links empirical receptive field concepts to practical design choices for extrapolatable transformers and highlights potential energy efficiency benefits, while noting remaining challenges such as recency bias and ethical considerations.
Abstract
Length extrapolation permits training a transformer language model on short sequences that preserves perplexities when tested on substantially longer sequences. A relative positional embedding design, ALiBi, has had the widest usage to date. We dissect ALiBi via the lens of receptive field analysis empowered by a novel cumulative normalized gradient tool. The concept of receptive field further allows us to modify the vanilla Sinusoidal positional embedding to create ~\textbf{Sandwich}, the first parameter-free relative positional embedding design that truly length information uses longer than the training sequence. Sandwich shares with KERPLE and T5 the same logarithmic decaying temporal bias pattern with learnable relative positional embeddings; these elucidate future extrapolatable positional embedding design.
