Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis

Ta-Chung Chi; Ting-Han Fan; Alexander I. Rudnicky; Peter J. Ramadge

Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis

Ta-Chung Chi, Ting-Han Fan, Alexander I. Rudnicky, Peter J. Ramadge

TL;DR

The paper investigates why length extrapolation works in transformer language models by examining the role of positional embeddings through a receptive-field lens. It introduces a cumulative gradient-based empirical receptive field tool and uses it to analyze ALiBi and windowed attention, showing their extrapolation depends on training length covering the empirical receptive field. To address limitations, the authors propose Sandwich, a parameter-free relative positional embedding derived from sinusoidal embeddings that can leverage information beyond the training length, with patterns similar to KERPLE and T5. The work links empirical receptive field concepts to practical design choices for extrapolatable transformers and highlights potential energy efficiency benefits, while noting remaining challenges such as recency bias and ethical considerations.

Abstract

Length extrapolation permits training a transformer language model on short sequences that preserves perplexities when tested on substantially longer sequences. A relative positional embedding design, ALiBi, has had the widest usage to date. We dissect ALiBi via the lens of receptive field analysis empowered by a novel cumulative normalized gradient tool. The concept of receptive field further allows us to modify the vanilla Sinusoidal positional embedding to create ~\textbf{Sandwich}, the first parameter-free relative positional embedding design that truly length information uses longer than the training sequence. Sandwich shares with KERPLE and T5 the same logarithmic decaying temporal bias pattern with learnable relative positional embeddings; these elucidate future extrapolatable positional embedding design.

Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis

TL;DR

Abstract

Paper Structure (30 sections, 16 equations, 13 figures, 7 tables)

This paper contains 30 sections, 16 equations, 13 figures, 7 tables.

Introduction
Related Work
Length Extrapolation
Positional Embeddings
Windowed and Sparse Attention
Receptive Field
Background and Notations
Transformer Language Model
ALiBi
Windowed Attention
Evaluation of Length Extrapolation
ALiBi and Windowed Attention
Slope Shift (Shift all $h$ by $\Delta$)
Slope Equalization (Same $h$ for all heads)
Windowed Attention (Size $w$)
...and 15 more sections

Figures (13)

Figure 1: ALiBi. For a transformer language model with $H$ attention heads, the range of $h$ is $n\cdot\frac{8}{H}$, where $n=\{1\dots H\}$. Left = self-attention matrix, right = temporal biases matrix.
Figure 2: Windowed Attention. This is the same design as Longformer beltagy2020longformer. We limit the context window size to $w=2$ in this example. Left = self-attention matrix, right = temporal biases matrix.
Figure 3: We always evaluate the perplexities of the 5 tokens numbered from 1 to 5. The upper brackets represent $L_{ex}=5$. The lower brackets represent $L_{ex}=3$. This formulation ensures the same 5 tokens are always evaluated with different numbers of previous tokens.
Figure 4: Cumulative normalized gradient on ArXiv when predicting the next (2048-th) token.
Figure 5: Cumulative normalized gradient on ArXiv when predicting the next (2048-th) token.
...and 8 more figures

Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis

TL;DR

Abstract

Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (13)