Table of Contents
Fetching ...

Extending Input Contexts of Language Models through Training on Segmented Sequences

Petros Karypis, Julian McAuley, George Karypis

TL;DR

This work develops a training procedure to extend the input context size of pretrained models with no architectural changes and no additional memory costs than training on the original input lengths by sub-sampling segments from long inputs while maintaining their original position.

Abstract

Effectively training language models on long inputs poses many technical challenges. As a cost consideration, languages models are pretrained on a fixed sequence length before being adapted to longer sequences. We explore various methods for adapting models to longer inputs by training on segmented sequences and an interpolation-based method for extending absolute positional embeddings. We develop a training procedure to extend the input context size of pretrained models with no architectural changes and no additional memory costs than training on the original input lengths. By sub-sampling segments from long inputs while maintaining their original position the model is able to learn new positional interactions. Our method benefits both models trained with absolute positional embeddings, by extending their input contexts, as well as popular relative positional embedding methods showing a reduced perplexity on sequences longer than they were trained on. We demonstrate our method can extend input contexts by a factor of 4x while improving perplexity.

Extending Input Contexts of Language Models through Training on Segmented Sequences

TL;DR

This work develops a training procedure to extend the input context size of pretrained models with no architectural changes and no additional memory costs than training on the original input lengths by sub-sampling segments from long inputs while maintaining their original position.

Abstract

Effectively training language models on long inputs poses many technical challenges. As a cost consideration, languages models are pretrained on a fixed sequence length before being adapted to longer sequences. We explore various methods for adapting models to longer inputs by training on segmented sequences and an interpolation-based method for extending absolute positional embeddings. We develop a training procedure to extend the input context size of pretrained models with no architectural changes and no additional memory costs than training on the original input lengths. By sub-sampling segments from long inputs while maintaining their original position the model is able to learn new positional interactions. Our method benefits both models trained with absolute positional embeddings, by extending their input contexts, as well as popular relative positional embedding methods showing a reduced perplexity on sequences longer than they were trained on. We demonstrate our method can extend input contexts by a factor of 4x while improving perplexity.
Paper Structure (27 sections, 5 equations, 3 figures, 6 tables)

This paper contains 27 sections, 5 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Visualization of our various segment-based methods. We sub-sampling tokens from the original sequence while maintaining the original positions.
  • Figure 2: Perplexity of "out-of-the-box" extrapolation. With interpolation of the positional embeddings, absolute positional embeddings (APE) extrapolate as well as ALiBi.
  • Figure 3: Histogram of median attention weights for positions past the original input length before and after our segmented training on models with RoPE. After adaptation, the distribution of attention weights becomes more uniform.