Table of Contents
Fetching ...

Thus Spake Long-Context Large Language Model

Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Ziwei He, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu

TL;DR

This survey maps the lifecycle of long-context LLMs across four intertwined dimensions: architecture, infrastructure, training, and evaluation. It synthesizes techniques spanning RoPE and alternatives for length extrapolation, multiple KV cache strategies, memory management, and novel architectures like SSM-Mamba and LSTM-RWKV. The authors catalog infrastructure breakthroughs for training and inference at extreme context lengths, and discuss long-context data quality, curation, and post-training strategies, including multi-modal extensions. They conclude with ten open questions to stimulate future work, aiming to guide researchers toward practical, scalable lifelong-context models with robust evaluation frameworks.

Abstract

Long context is an important topic in Natural Language Processing (NLP), running through the development of NLP architectures, and offers immense opportunities for Large Language Models (LLMs), giving LLMs the lifelong learning potential akin to humans. Unfortunately, the pursuit of a long context is accompanied by numerous obstacles. Nevertheless, long context remains a core competitive advantage for LLMs. In the past two years, the context length of LLMs has achieved a breakthrough extension to millions of tokens. Moreover, research on long-context LLMs has expanded beyond length extrapolation to a comprehensive focus on architecture, infrastructure, training, and evaluation technologies. Inspired by the symphonic poem, Thus Spake Zarathustra, we draw an analogy between the journey of extending the context of LLM and the attempts of humans to transcend their mortality. In this survey, we will illustrate how LLM struggles between the tremendous need for a longer context and its equal need to accept the fact that it is ultimately finite. To achieve this, we give a global picture of the lifecycle of long-context LLMs from four perspectives: architecture, infrastructure, training, and evaluation, showcasing the full spectrum of long-context technologies. At the end of this survey, we will present 10 unanswered questions currently faced by long-context LLMs. We hope this survey can serve as a systematic introduction to research on long-context LLMs. Video: https://www.bilibili.com/video/BV11h9AYoEYj. Github: https://github.com/OpenMOSS/Thus-Spake-Long-Context-LLM.

Thus Spake Long-Context Large Language Model

TL;DR

This survey maps the lifecycle of long-context LLMs across four intertwined dimensions: architecture, infrastructure, training, and evaluation. It synthesizes techniques spanning RoPE and alternatives for length extrapolation, multiple KV cache strategies, memory management, and novel architectures like SSM-Mamba and LSTM-RWKV. The authors catalog infrastructure breakthroughs for training and inference at extreme context lengths, and discuss long-context data quality, curation, and post-training strategies, including multi-modal extensions. They conclude with ten open questions to stimulate future work, aiming to guide researchers toward practical, scalable lifelong-context models with robust evaluation frameworks.

Abstract

Long context is an important topic in Natural Language Processing (NLP), running through the development of NLP architectures, and offers immense opportunities for Large Language Models (LLMs), giving LLMs the lifelong learning potential akin to humans. Unfortunately, the pursuit of a long context is accompanied by numerous obstacles. Nevertheless, long context remains a core competitive advantage for LLMs. In the past two years, the context length of LLMs has achieved a breakthrough extension to millions of tokens. Moreover, research on long-context LLMs has expanded beyond length extrapolation to a comprehensive focus on architecture, infrastructure, training, and evaluation technologies. Inspired by the symphonic poem, Thus Spake Zarathustra, we draw an analogy between the journey of extending the context of LLM and the attempts of humans to transcend their mortality. In this survey, we will illustrate how LLM struggles between the tremendous need for a longer context and its equal need to accept the fact that it is ultimately finite. To achieve this, we give a global picture of the lifecycle of long-context LLMs from four perspectives: architecture, infrastructure, training, and evaluation, showcasing the full spectrum of long-context technologies. At the end of this survey, we will present 10 unanswered questions currently faced by long-context LLMs. We hope this survey can serve as a systematic introduction to research on long-context LLMs. Video: https://www.bilibili.com/video/BV11h9AYoEYj. Github: https://github.com/OpenMOSS/Thus-Spake-Long-Context-LLM.

Paper Structure

This paper contains 122 sections, 4 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: An overview of Thus Spake Long-Context Large Language Model.
  • Figure 2: Long-context performance of various LLMs across multiple benchmarks, perplexity (PPL) presstrain, NIAH niah, and RULER hsieh2024ruler. The horizontal axis represents the release time, while the vertical axis indicates the effective context length achieved by the LLMs on the corresponding task. The line associated with each task represents the state-of-the-art performance at a given point in time.
  • Figure 3: An overview of length extrapolation of long-context LLMs.
  • Figure 4: An overview of KV cache optimization of long-context LLMs.
  • Figure 5: An overview of memory management of long-context LLMs.
  • ...and 4 more figures