Table of Contents
Fetching ...

A Controlled Study on Long Context Extension and Generalization in LLMs

Yi Lu, Jing Nathan Yan, Songlin Yang, Justin T. Chiu, Siyu Ren, Fei Yuan, Wenting Zhao, Zhiyong Wu, Alexander M. Rush

TL;DR

The paper establishes a controlled protocol to compare long-context extension methods for LLMs using a fixed base model and data, revealing that exact-attention, continually finetuned approaches (notably NTK-based variants) generally outperform approximate methods and that perplexity closely tracks downstream performance for these setups. It demonstrates that while some methods can generalize to longer, unseen contexts, extrapolation beyond tens of thousands of tokens remains challenging and data-hungry. The work also highlights trade-offs between training requirements, inference efficiency, and long-range retrieval capabilities, and it provides transparent, open-source resources to catalyze further research. Overall, the study clarifies how to evaluate long-context extensions and which approaches are most promising for scaling contextual reasoning in LLMs.

Abstract

Broad textual understanding and in-context learning require language models that utilize full document contexts. Due to the implementation challenges associated with directly training long-context models, many methods have been proposed for extending models to handle long contexts. However, owing to differences in data and model classes, it has been challenging to compare these approaches, leading to uncertainty as to how to evaluate long-context performance and whether it differs from standard evaluation. We implement a controlled protocol for extension methods with a standardized evaluation, utilizing consistent base models and extension data. Our study yields several insights into long-context behavior. First, we reaffirm the critical role of perplexity as a general-purpose performance indicator even in longer-context tasks. Second, we find that current approximate attention methods systematically underperform across long-context tasks. Finally, we confirm that exact fine-tuning based methods are generally effective within the range of their extension, whereas extrapolation remains challenging. All codebases, models, and checkpoints will be made available open-source, promoting transparency and facilitating further research in this critical area of AI development.

A Controlled Study on Long Context Extension and Generalization in LLMs

TL;DR

The paper establishes a controlled protocol to compare long-context extension methods for LLMs using a fixed base model and data, revealing that exact-attention, continually finetuned approaches (notably NTK-based variants) generally outperform approximate methods and that perplexity closely tracks downstream performance for these setups. It demonstrates that while some methods can generalize to longer, unseen contexts, extrapolation beyond tens of thousands of tokens remains challenging and data-hungry. The work also highlights trade-offs between training requirements, inference efficiency, and long-range retrieval capabilities, and it provides transparent, open-source resources to catalyze further research. Overall, the study clarifies how to evaluate long-context extensions and which approaches are most promising for scaling contextual reasoning in LLMs.

Abstract

Broad textual understanding and in-context learning require language models that utilize full document contexts. Due to the implementation challenges associated with directly training long-context models, many methods have been proposed for extending models to handle long contexts. However, owing to differences in data and model classes, it has been challenging to compare these approaches, leading to uncertainty as to how to evaluate long-context performance and whether it differs from standard evaluation. We implement a controlled protocol for extension methods with a standardized evaluation, utilizing consistent base models and extension data. Our study yields several insights into long-context behavior. First, we reaffirm the critical role of perplexity as a general-purpose performance indicator even in longer-context tasks. Second, we find that current approximate attention methods systematically underperform across long-context tasks. Finally, we confirm that exact fine-tuning based methods are generally effective within the range of their extension, whereas extrapolation remains challenging. All codebases, models, and checkpoints will be made available open-source, promoting transparency and facilitating further research in this critical area of AI development.
Paper Structure (37 sections, 12 equations, 5 figures, 16 tables)

This paper contains 37 sections, 12 equations, 5 figures, 16 tables.

Figures (5)

  • Figure 1: Needle in a Haystack evaluation. Green squares indicates a high retrieval success rate, the white dashed line denotes the longest length examples seen at training or finetuning, and the Y-axis represents the distance to the retrieved target.
  • Figure 2: Many-shot in-context learning accuracy on TREC News.
  • Figure 3: Averaged negative log-likelihood of different models broken down by context position.
  • Figure 4: Perplexity and averaged downstream task accuracy for Needle in a haystack, LongBench and RULER.
  • Figure 5: Needle in a Haystack evaluation. "NTK-64-2B" represents the NTK-64K model trained with 2B tokens. Green squares indicates a high retrieval success rate, the white dashed line denotes the longest length examples seen at training or finetuning, and the Y-axis represents the distance to the retrieved target.