Table of Contents
Fetching ...

LongSSM: On the Length Extension of State-space Models in Language Modelling

Shida Wang

TL;DR

This work examines the length-extension problem in language modeling with state-space models (SSMs). It identifies zero-hidden-state initialization as a key bottleneck and shows that length extension effectively behaves as polynomial extrapolation. The authors propose a simple initialization scheme using previous hidden states to convert extrapolation into interpolation, enabling monotone length extension even with short training sequences and reduced memory requirements. They demonstrate both theoretical insights and empirical gains, including scenarios where training context can be as short as $T=16$ yet extend to very long inference contexts, while also acknowledging stability limitations in some settings. The approach provides a practical path to efficient long-context modeling with SSMs, complementing existing long-context techniques and highlighting future work on stability for large models.

Abstract

In this paper, we investigate the length-extension of state-space models (SSMs) in language modeling. Length extension involves training models on short sequences and testing them on longer ones. We show that state-space models trained with zero hidden states initialization have difficulty doing length extension. We explain this difficulty by pointing out the length extension is equivalent to polynomial extrapolation. Based on the theory, we propose a simple yet effective method - changing the hidden states initialization scheme - to improve the length extension. Moreover, our method shows that using long training sequence length is beneficial but not necessary to length extension. Changing the hidden state initialization enables the efficient training of long-memory model with a smaller training context length.

LongSSM: On the Length Extension of State-space Models in Language Modelling

TL;DR

This work examines the length-extension problem in language modeling with state-space models (SSMs). It identifies zero-hidden-state initialization as a key bottleneck and shows that length extension effectively behaves as polynomial extrapolation. The authors propose a simple initialization scheme using previous hidden states to convert extrapolation into interpolation, enabling monotone length extension even with short training sequences and reduced memory requirements. They demonstrate both theoretical insights and empirical gains, including scenarios where training context can be as short as yet extend to very long inference contexts, while also acknowledging stability limitations in some settings. The approach provides a practical path to efficient long-context modeling with SSMs, complementing existing long-context techniques and highlighting future work on stability for large models.

Abstract

In this paper, we investigate the length-extension of state-space models (SSMs) in language modeling. Length extension involves training models on short sequences and testing them on longer ones. We show that state-space models trained with zero hidden states initialization have difficulty doing length extension. We explain this difficulty by pointing out the length extension is equivalent to polynomial extrapolation. Based on the theory, we propose a simple yet effective method - changing the hidden states initialization scheme - to improve the length extension. Moreover, our method shows that using long training sequence length is beneficial but not necessary to length extension. Changing the hidden state initialization enables the efficient training of long-memory model with a smaller training context length.
Paper Structure (30 sections, 7 theorems, 31 equations, 10 figures, 2 tables)

This paper contains 30 sections, 7 theorems, 31 equations, 10 figures, 2 tables.

Key Result

Theorem 2.2

Assume the entropies of language across different sequence lengths are all finite. Consider the autoregressive language modeling as the learning of sequence of random variables $\{X_k\}_{k=1}^\infty$. The ideal autoregressive language models return the next random variable $\mathbf{X}_{T+1}$ based o Consider the entropy of this autoregressive language model By monotonicity of entropy and shift-in

Figures (10)

  • Figure 1: Three types of length-extension capabilities.
  • Figure 2: Length extension performance of Mamba evaluated over the Pile dataset gao2020.Pile800GBDataset. The models are trained with a sequence length of 2048. Although perplexity remains finite for sequences up to 4096, it increases significantly for lengths beyond 8192.
  • Figure 3: Graphical demonstration of the difference between zero-initialized hidden states and previous-initialized hidden states (truncated backpropagation through time) in training.
  • Figure 4: Comparison of two hidden states initialization methods over 6-layer Mamba with 30M parameters. Both the zero-initialized and previous-initialized models are trained over training sequence length $T=32$. The zero-initialized model has difficulty extrapolating beyond 1024 while the the previous-initialized model has length extrapolation up to $T=32768$. While the previous hidden state methods achieve the length extension over unshuffled test dataset, when the data is shuffled, models trained with previous hidden state also suffer from the noisy information in the hidden states.
  • Figure 5: Length extension of models trained with different sequence length using previous-initialized hidden states. We train 6-layer S5 smith2023.SimplifiedStateSpace up to training length $T=32768$ and 6-layer Mamba gu2023.MambaLinearTimeSequence up to training length $T=8192$. Mamba has a larger hidden states dimension therefore the maximum training length is smaller (on the same GPU). It can be seen that training with sequence length $T=1024$ is slightly better than shorter/longer sequence length.
  • ...and 5 more figures

Theorems & Definitions (9)

  • Definition 2.1
  • Theorem 2.2: Existence of weak length extension in autoregressive language modeling
  • Theorem B.1: Information inequality
  • Corollary B.2: Nonnegativity of mutual information
  • Theorem B.3: Conditioning reduces entropy
  • Theorem B.4: Riesz-Markov-Kakutani representation theorem
  • Theorem C.1: Associativity of binary operation in state-space models
  • proof
  • Proposition C.2