Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning
Wang Yang, Zirui Liu, Hongye Jin, Qingyu Yin, Vipin Chaudhary, Xiaotian Han
TL;DR
This work tackles the underexplored question of how long-context capacity influences reasoning in large language models. It employs controlled experiments that vary long-context pretraining while keeping architecture and fine-tuning data constant, and demonstrates that stronger long-context ability yields higher reasoning accuracy after supervised fine-tuning, with benefits extending to short-input tasks. Through RoPE theta scaling and model merging, the authors show that extending the effective context length to $128K$–$1M$ tokens can improve reasoning performance, and they propose a practical recipe: first extend long-context capacity, then apply reasoning-focused fine-tuning. The findings advocate treating long-context capacity as a first-class objective in future model design, with potential broad impact on reasoning tasks and data efficiency across benchmarks like MATH500, AIME, and GSM8K.
Abstract
Recent language models exhibit strong reasoning capabilities, yet the influence of long-context capacity on reasoning remains underexplored. In this work, we hypothesize that current limitations in reasoning stem, in part, from insufficient long-context capacity, motivated by empirical observations such as (1) higher context window length often leads to stronger reasoning performance, and (2) failed reasoning cases resemble failed long-context cases. To test this hypothesis, we examine whether enhancing a model's long-context ability before Supervised Fine-Tuning (SFT) leads to improved reasoning performance. Specifically, we compared models with identical architectures and fine-tuning data but varying levels of long-context capacity. Our results reveal a consistent trend: models with stronger long-context capacity achieve significantly higher accuracy on reasoning benchmarks after SFT. Notably, these gains persist even on tasks with short input lengths, indicating that long-context training offers generalizable benefits for reasoning performance. These findings suggest that long-context modeling is not just essential for processing lengthy inputs, but also serves as a critical foundation for reasoning. We advocate for treating long-context capacity as a first-class objective in the design of future language models.
