Table of Contents
Fetching ...

Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis

Ta-Chung Chi, Ting-Han Fan, Alexander I. Rudnicky, Peter J. Ramadge

TL;DR

The paper investigates why length extrapolation works in transformer language models by examining the role of positional embeddings through a receptive-field lens. It introduces a cumulative gradient-based empirical receptive field tool and uses it to analyze ALiBi and windowed attention, showing their extrapolation depends on training length covering the empirical receptive field. To address limitations, the authors propose Sandwich, a parameter-free relative positional embedding derived from sinusoidal embeddings that can leverage information beyond the training length, with patterns similar to KERPLE and T5. The work links empirical receptive field concepts to practical design choices for extrapolatable transformers and highlights potential energy efficiency benefits, while noting remaining challenges such as recency bias and ethical considerations.

Abstract

Length extrapolation permits training a transformer language model on short sequences that preserves perplexities when tested on substantially longer sequences. A relative positional embedding design, ALiBi, has had the widest usage to date. We dissect ALiBi via the lens of receptive field analysis empowered by a novel cumulative normalized gradient tool. The concept of receptive field further allows us to modify the vanilla Sinusoidal positional embedding to create ~\textbf{Sandwich}, the first parameter-free relative positional embedding design that truly length information uses longer than the training sequence. Sandwich shares with KERPLE and T5 the same logarithmic decaying temporal bias pattern with learnable relative positional embeddings; these elucidate future extrapolatable positional embedding design.

Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis

TL;DR

The paper investigates why length extrapolation works in transformer language models by examining the role of positional embeddings through a receptive-field lens. It introduces a cumulative gradient-based empirical receptive field tool and uses it to analyze ALiBi and windowed attention, showing their extrapolation depends on training length covering the empirical receptive field. To address limitations, the authors propose Sandwich, a parameter-free relative positional embedding derived from sinusoidal embeddings that can leverage information beyond the training length, with patterns similar to KERPLE and T5. The work links empirical receptive field concepts to practical design choices for extrapolatable transformers and highlights potential energy efficiency benefits, while noting remaining challenges such as recency bias and ethical considerations.

Abstract

Length extrapolation permits training a transformer language model on short sequences that preserves perplexities when tested on substantially longer sequences. A relative positional embedding design, ALiBi, has had the widest usage to date. We dissect ALiBi via the lens of receptive field analysis empowered by a novel cumulative normalized gradient tool. The concept of receptive field further allows us to modify the vanilla Sinusoidal positional embedding to create ~\textbf{Sandwich}, the first parameter-free relative positional embedding design that truly length information uses longer than the training sequence. Sandwich shares with KERPLE and T5 the same logarithmic decaying temporal bias pattern with learnable relative positional embeddings; these elucidate future extrapolatable positional embedding design.
Paper Structure (30 sections, 16 equations, 13 figures, 7 tables)

This paper contains 30 sections, 16 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: ALiBi. For a transformer language model with $H$ attention heads, the range of $h$ is $n\cdot\frac{8}{H}$, where $n=\{1\dots H\}$. Left = self-attention matrix, right = temporal biases matrix.
  • Figure 2: Windowed Attention. This is the same design as Longformer beltagy2020longformer. We limit the context window size to $w=2$ in this example. Left = self-attention matrix, right = temporal biases matrix.
  • Figure 3: We always evaluate the perplexities of the 5 tokens numbered from 1 to 5. The upper brackets represent $L_{ex}=5$. The lower brackets represent $L_{ex}=3$. This formulation ensures the same 5 tokens are always evaluated with different numbers of previous tokens.
  • Figure 4: Cumulative normalized gradient on ArXiv when predicting the next (2048-th) token.
  • Figure 5: Cumulative normalized gradient on ArXiv when predicting the next (2048-th) token.
  • ...and 8 more figures