Table of Contents
Fetching ...

Wavelet-based Positional Representation for Long Context

Yui Oka, Taku Hasegawa, Kyosuke Nishida, Kuniko Saito

TL;DR

Extrapolating positional information for sequences beyond training length is challenging due to fixed-position embeddings. The paper shows RoPE corresponds to a Haar-like wavelet with a fixed scale, and ALiBi provides extrapolation via multiple window sizes but can constrain the receptive field. It then introduces a wavelet-transform–based relative positional representation that incorporates multiple scales to capture non-stationary linguistic dynamics without limiting attention. Empirical results on WikiText-103 and long-context benchmarks (e.g., Llama-2 and CodeParrot) demonstrate lower perplexity and better long-range dependency handling than RoPE, ALiBi, and Trans-XL, validating the practicality of a multi-scale wavelet approach for extrapolation. Overall, the method offers a principled path to multi-scale, context-aware positional encoding that preserves flexible attention while enabling robust long-context performance.

Abstract

In the realm of large-scale language models, a significant challenge arises when extrapolating sequences beyond the maximum allowable length. This is because the model's position embedding mechanisms are limited to positions encountered during training, thus preventing effective representation of positions in longer sequences. We analyzed conventional position encoding methods for long contexts and found the following characteristics. (1) When the representation dimension is regarded as the time axis, Rotary Position Embedding (RoPE) can be interpreted as a restricted wavelet transform using Haar-like wavelets. However, because it uses only a fixed scale parameter, it does not fully exploit the advantages of wavelet transforms, which capture the fine movements of non-stationary signals using multiple scales (window sizes). This limitation could explain why RoPE performs poorly in extrapolation. (2) Previous research as well as our own analysis indicates that Attention with Linear Biases (ALiBi) functions similarly to windowed attention, using windows of varying sizes. However, it has limitations in capturing deep dependencies because it restricts the receptive field of the model. From these insights, we propose a new position representation method that captures multiple scales (i.e., window sizes) by leveraging wavelet transforms without limiting the model's attention field. Experimental results show that this new method improves the performance of the model in both short and long contexts. In particular, our method allows extrapolation of position information without limiting the model's attention field.

Wavelet-based Positional Representation for Long Context

TL;DR

Extrapolating positional information for sequences beyond training length is challenging due to fixed-position embeddings. The paper shows RoPE corresponds to a Haar-like wavelet with a fixed scale, and ALiBi provides extrapolation via multiple window sizes but can constrain the receptive field. It then introduces a wavelet-transform–based relative positional representation that incorporates multiple scales to capture non-stationary linguistic dynamics without limiting attention. Empirical results on WikiText-103 and long-context benchmarks (e.g., Llama-2 and CodeParrot) demonstrate lower perplexity and better long-range dependency handling than RoPE, ALiBi, and Trans-XL, validating the practicality of a multi-scale wavelet approach for extrapolation. Overall, the method offers a principled path to multi-scale, context-aware positional encoding that preserves flexible attention while enabling robust long-context performance.

Abstract

In the realm of large-scale language models, a significant challenge arises when extrapolating sequences beyond the maximum allowable length. This is because the model's position embedding mechanisms are limited to positions encountered during training, thus preventing effective representation of positions in longer sequences. We analyzed conventional position encoding methods for long contexts and found the following characteristics. (1) When the representation dimension is regarded as the time axis, Rotary Position Embedding (RoPE) can be interpreted as a restricted wavelet transform using Haar-like wavelets. However, because it uses only a fixed scale parameter, it does not fully exploit the advantages of wavelet transforms, which capture the fine movements of non-stationary signals using multiple scales (window sizes). This limitation could explain why RoPE performs poorly in extrapolation. (2) Previous research as well as our own analysis indicates that Attention with Linear Biases (ALiBi) functions similarly to windowed attention, using windows of varying sizes. However, it has limitations in capturing deep dependencies because it restricts the receptive field of the model. From these insights, we propose a new position representation method that captures multiple scales (i.e., window sizes) by leveraging wavelet transforms without limiting the model's attention field. Experimental results show that this new method improves the performance of the model in both short and long contexts. In particular, our method allows extrapolation of position information without limiting the model's attention field.

Paper Structure

This paper contains 51 sections, 30 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Overview of Wavelet-based Relative Positional Representation As in RPE shaw-etal-2018-self, our method computes a relative positional representation $(p_{m,n})^{T}$ to the query $q_{m}$ and the key $k_{n}$. Instead of learnable embedding in RPE, the position is computed based on the wavelet function. Different wavelet functions $\psi_{a,b}$ are used for each dimension of the head $d$. Furthermore, the scale parameter $a$ and the shift parameter $b$ change depending on the dimension of the head $d$.
  • Figure 2: Heatmap of scaled attention scores via softmax normalization in ALiBi without non-overlapping inference. The vertical axis represents the query, while the horizontal axis corresponds to the key in the attention map. For clarity, values of 0.001 or more are mapped to black, while values below that are mapped to yellow. The maximum allowable length of sequences is $L_{\rm train}=512$, and the inference length is $1012$.
  • Figure 3: Heatmap of scaled attention scores via softmax normalization in 4th head after softmax operation without non-overlapping inference. The vertical axis represents the query, while the horizontal axis corresponds to the key. For clarity, values of 0.001 or more are mapped to black, while values below that are mapped to yellow. The maximum allowable length of sequences in pre-training is $L_{\rm train}=512$ and the inference length is $1012$. See Appendix \ref{['a_attn']} for other heads.
  • Figure 4: Graph of compared Ricker wavelet functions with $a = [2^0,2^1,2^2,2^3,2^4]$
  • Figure 6: Graph of compared wavelet functions. The case with scale parameter $a=2^4$ and shift parameter $b=0$ is shown.
  • ...and 5 more figures