Table of Contents
Fetching ...

Position Encoding with Random Float Sampling Enhances Length Generalization of Transformers

Atsushi Shimizu, Shohei Taniguchi, Yutaka Matsuo

TL;DR

This paper tackles the challenge of length generalization in Transformers by introducing Random Float Sampling (RFS), a position indexing method that samples continuous indices from a shared range during both training and inference. By replacing fixed discrete position indices with randomly drawn continuous ones, RFS reduces out-of-distribution issues when handling unseen input lengths and can be plugged into existing position encodings such as absolute sinusoidal, RoPE, and ALiBi. Empirical results demonstrate strong improvements on length generalization tasks and competitive zero-shot commonsense reasoning performance, with notable gains over traditional methods like simple extension or random integer sampling. The findings suggest that exposing the model to a diverse set of position distances during training enhances its ability to reason over longer contexts, offering a practical and deployment-friendly approach for robustness in language modeling and sequence tasks.

Abstract

Length generalization is the ability of language models to maintain performance on inputs longer than those seen during pretraining. In this work, we introduce a simple yet powerful position encoding (PE) strategy, Random Float Sampling (RFS), that generalizes well to lengths unseen during pretraining or fine-tuning. In particular, instead of selecting position indices from a predefined discrete set, RFS uses randomly sampled continuous values, thereby avoiding out-of-distribution (OOD) issues on unseen lengths by exposing the model to diverse indices during training. Since assigning indices to tokens is a common and fundamental procedure in widely used PEs, the advantage of RFS can easily be incorporated into, for instance, the absolute sinusoidal encoding, RoPE, and ALiBi. Experiments corroborate its effectiveness by showing that RFS results in superior performance in length generalization tasks as well as zero-shot commonsense reasoning benchmarks.

Position Encoding with Random Float Sampling Enhances Length Generalization of Transformers

TL;DR

This paper tackles the challenge of length generalization in Transformers by introducing Random Float Sampling (RFS), a position indexing method that samples continuous indices from a shared range during both training and inference. By replacing fixed discrete position indices with randomly drawn continuous ones, RFS reduces out-of-distribution issues when handling unseen input lengths and can be plugged into existing position encodings such as absolute sinusoidal, RoPE, and ALiBi. Empirical results demonstrate strong improvements on length generalization tasks and competitive zero-shot commonsense reasoning performance, with notable gains over traditional methods like simple extension or random integer sampling. The findings suggest that exposing the model to a diverse set of position distances during training enhances its ability to reason over longer contexts, offering a practical and deployment-friendly approach for robustness in language modeling and sequence tasks.

Abstract

Length generalization is the ability of language models to maintain performance on inputs longer than those seen during pretraining. In this work, we introduce a simple yet powerful position encoding (PE) strategy, Random Float Sampling (RFS), that generalizes well to lengths unseen during pretraining or fine-tuning. In particular, instead of selecting position indices from a predefined discrete set, RFS uses randomly sampled continuous values, thereby avoiding out-of-distribution (OOD) issues on unseen lengths by exposing the model to diverse indices during training. Since assigning indices to tokens is a common and fundamental procedure in widely used PEs, the advantage of RFS can easily be incorporated into, for instance, the absolute sinusoidal encoding, RoPE, and ALiBi. Experiments corroborate its effectiveness by showing that RFS results in superior performance in length generalization tasks as well as zero-shot commonsense reasoning benchmarks.
Paper Structure (21 sections, 3 equations, 7 figures, 7 tables)

This paper contains 21 sections, 3 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: An illustration of the existing and proposed position indexing strategies. RFS avoids the OOD issue for any context length $K$, which is unknown during training.
  • Figure 2: Results of different PEs on the copy task. The shaded area indicates that those lengths are seen during training. RFS boosts the performance significantly and leads to better results than NoPE. The experiment setup is stated in Section \ref{['subsec:lg']}.
  • Figure 3: Results of length generalization tasks. The shaded area indicates the input lengths seen during training.
  • Figure 4: Results of RFS, position interpolation, and random integer sampling on the copy task. The shaded area indicates the input lengths seen during training.
  • Figure 5: The singular values of the additive sinusoidal position matrix. The dimensionality is 256. The position index range is set to $[0, 2048]$ in our method (bottom). With the simple extension, the longer the input gets, the higher the rank of the position matrix becomes. RFS effectively controls the rank no matter the input length.
  • ...and 2 more figures