Table of Contents
Fetching ...

Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning

Wang Yang, Zirui Liu, Hongye Jin, Qingyu Yin, Vipin Chaudhary, Xiaotian Han

TL;DR

This work tackles the underexplored question of how long-context capacity influences reasoning in large language models. It employs controlled experiments that vary long-context pretraining while keeping architecture and fine-tuning data constant, and demonstrates that stronger long-context ability yields higher reasoning accuracy after supervised fine-tuning, with benefits extending to short-input tasks. Through RoPE theta scaling and model merging, the authors show that extending the effective context length to $128K$–$1M$ tokens can improve reasoning performance, and they propose a practical recipe: first extend long-context capacity, then apply reasoning-focused fine-tuning. The findings advocate treating long-context capacity as a first-class objective in future model design, with potential broad impact on reasoning tasks and data efficiency across benchmarks like MATH500, AIME, and GSM8K.

Abstract

Recent language models exhibit strong reasoning capabilities, yet the influence of long-context capacity on reasoning remains underexplored. In this work, we hypothesize that current limitations in reasoning stem, in part, from insufficient long-context capacity, motivated by empirical observations such as (1) higher context window length often leads to stronger reasoning performance, and (2) failed reasoning cases resemble failed long-context cases. To test this hypothesis, we examine whether enhancing a model's long-context ability before Supervised Fine-Tuning (SFT) leads to improved reasoning performance. Specifically, we compared models with identical architectures and fine-tuning data but varying levels of long-context capacity. Our results reveal a consistent trend: models with stronger long-context capacity achieve significantly higher accuracy on reasoning benchmarks after SFT. Notably, these gains persist even on tasks with short input lengths, indicating that long-context training offers generalizable benefits for reasoning performance. These findings suggest that long-context modeling is not just essential for processing lengthy inputs, but also serves as a critical foundation for reasoning. We advocate for treating long-context capacity as a first-class objective in the design of future language models.

Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning

TL;DR

This work tackles the underexplored question of how long-context capacity influences reasoning in large language models. It employs controlled experiments that vary long-context pretraining while keeping architecture and fine-tuning data constant, and demonstrates that stronger long-context ability yields higher reasoning accuracy after supervised fine-tuning, with benefits extending to short-input tasks. Through RoPE theta scaling and model merging, the authors show that extending the effective context length to tokens can improve reasoning performance, and they propose a practical recipe: first extend long-context capacity, then apply reasoning-focused fine-tuning. The findings advocate treating long-context capacity as a first-class objective in future model design, with potential broad impact on reasoning tasks and data efficiency across benchmarks like MATH500, AIME, and GSM8K.

Abstract

Recent language models exhibit strong reasoning capabilities, yet the influence of long-context capacity on reasoning remains underexplored. In this work, we hypothesize that current limitations in reasoning stem, in part, from insufficient long-context capacity, motivated by empirical observations such as (1) higher context window length often leads to stronger reasoning performance, and (2) failed reasoning cases resemble failed long-context cases. To test this hypothesis, we examine whether enhancing a model's long-context ability before Supervised Fine-Tuning (SFT) leads to improved reasoning performance. Specifically, we compared models with identical architectures and fine-tuning data but varying levels of long-context capacity. Our results reveal a consistent trend: models with stronger long-context capacity achieve significantly higher accuracy on reasoning benchmarks after SFT. Notably, these gains persist even on tasks with short input lengths, indicating that long-context training offers generalizable benefits for reasoning performance. These findings suggest that long-context modeling is not just essential for processing lengthy inputs, but also serves as a critical foundation for reasoning. We advocate for treating long-context capacity as a first-class objective in the design of future language models.

Paper Structure

This paper contains 14 sections, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Impact of long-context capacity on mathematical reasoning. Left: Accuracy (Pass@1) on MATH500 and AIME datasets for public models with 32k and 128k context lengths, showing consistent improvements in reasoning performance with longer context windows. The 32k and 128k LLMs refer to three different public models, as shown in \ref{['tab:Performance comparison']}. Right: Reasoning accuracy versus RoPE theta values, highlighting a strong correlation between long-context capacity and reasoning performance. Increasing the RoPE theta value typically extends the effective context window length.
  • Figure 2: Case Study: Repetition Failure. Two failure cases where the model produces clearly repetitive sentences in its answers. Such repetition is a common symptom of insufficient long-context capability, leading to strange responses and degraded reasoning quality in extended sequences.
  • Figure 3: Case Study: Contextual Reference Failures. Two failure cases where the model makes incorrect references to expressions introduced earlier in the problem. These errors occur in the later stages of the response and reflect a typical symptom of insufficient long-context capability.
  • Figure 4: Top: Length distribution of three reasoning datasets. NuminaMath-CoT represents early chain-of-thought (CoT) data with short sequences, while OpenR1-Math-220K and DeepMath-103K, generated by DeepSeek-R1, exhibit significantly longer outputs. Bottom-left: Performance of DeepSeek-Distilled-Qwen-1.5B on the Needle-in-a-Haystack benchmark with 32K context. Bottom-middle/right: Average output lengths of correct and incorrect generations on AIME and MATH500 for DeepSeek-Distilled-Qwen-1.5B, 7B, and 14B. Incorrect answers consistently exhibit longer output lengths, indicating potential limitations of long-context ability in reasoning.
  • Figure 5: Needle-in-a-Haystack Results for LLaMA-3-8B-Instruct. Performance of LLaMA-3-8B-Instruct on the Needle-in-a-Haystack benchmark with a 32K context under different RoPE theta scaling factors. RoPE theta x 16 refers to scaling the original RoPE theta by a factor of 16.
  • ...and 9 more figures