LongSSM: On the Length Extension of State-space Models in Language Modelling
Shida Wang
TL;DR
This work examines the length-extension problem in language modeling with state-space models (SSMs). It identifies zero-hidden-state initialization as a key bottleneck and shows that length extension effectively behaves as polynomial extrapolation. The authors propose a simple initialization scheme using previous hidden states to convert extrapolation into interpolation, enabling monotone length extension even with short training sequences and reduced memory requirements. They demonstrate both theoretical insights and empirical gains, including scenarios where training context can be as short as $T=16$ yet extend to very long inference contexts, while also acknowledging stability limitations in some settings. The approach provides a practical path to efficient long-context modeling with SSMs, complementing existing long-context techniques and highlighting future work on stability for large models.
Abstract
In this paper, we investigate the length-extension of state-space models (SSMs) in language modeling. Length extension involves training models on short sequences and testing them on longer ones. We show that state-space models trained with zero hidden states initialization have difficulty doing length extension. We explain this difficulty by pointing out the length extension is equivalent to polynomial extrapolation. Based on the theory, we propose a simple yet effective method - changing the hidden states initialization scheme - to improve the length extension. Moreover, our method shows that using long training sequence length is beneficial but not necessary to length extension. Changing the hidden state initialization enables the efficient training of long-memory model with a smaller training context length.
