Table of Contents
Fetching ...

Chain-of-Thought Matters: Improving Long-Context Language Models with Reasoning Path Supervision

Dawei Zhu, Xiyu Wei, Guangxiang Zhao, Wenhao Wu, Haosheng Zou, Junfeng Ran, Xun Wang, Lin Sun, Xiangzheng Zhang, Sujian Li

TL;DR

The paper tackles the challenge of long-context reasoning in large language models and investigates Chain-of-Thought (CoT) prompting, which shows promise but is underexplored for long contexts. It introduces LongRePS, a process-supervised framework that self-samples reasoning paths with explicit source citations, assesses CoT quality along answer correctness and process reliability (via source faithfulness and intrinsic consistency), and fine-tunes models on high-quality ToTs. Across MuSiQue and LongBench benchmarks, LongRePS yields substantial gains over outcome supervision, including +13.6 points for LLaMA-3.1-8B-Base and +3.8 for Qwen-2.5-7B-Base on MuSiQue, plus strong cross-domain improvements (+9.3/+8.1 on average). The results demonstrate that high-quality, ground-truth-grounded CoTs, learned through self-supervision with robust filtering, can significantly enhance long-context reasoning in a scalable manner.

Abstract

Recent advances in Large Language Models (LLMs) have highlighted the challenge of handling long-context tasks, where models need to reason over extensive input contexts to aggregate target information. While Chain-of-Thought (CoT) prompting has shown promise for multi-step reasoning, its effectiveness for long-context scenarios remains underexplored. Through systematic investigation across diverse tasks, we demonstrate that CoT's benefits generalize across most long-context scenarios and amplify with increasing context length. Motivated by this critical observation, we propose LongRePS, a process-supervised framework that teaches models to generate high-quality reasoning paths for enhanced long-context performance. Our framework incorporates a self-sampling mechanism to bootstrap reasoning paths and a novel quality assessment protocol specifically designed for long-context scenarios. Experimental results on various long-context benchmarks demonstrate the effectiveness of our approach, achieving significant improvements over outcome supervision baselines on both in-domain tasks (+13.6/+3.8 points for LLaMA/Qwen on MuSiQue) and cross-domain generalization (+9.3/+8.1 points on average across diverse QA tasks). Our code, data and trained models are made public to facilitate future research.

Chain-of-Thought Matters: Improving Long-Context Language Models with Reasoning Path Supervision

TL;DR

The paper tackles the challenge of long-context reasoning in large language models and investigates Chain-of-Thought (CoT) prompting, which shows promise but is underexplored for long contexts. It introduces LongRePS, a process-supervised framework that self-samples reasoning paths with explicit source citations, assesses CoT quality along answer correctness and process reliability (via source faithfulness and intrinsic consistency), and fine-tunes models on high-quality ToTs. Across MuSiQue and LongBench benchmarks, LongRePS yields substantial gains over outcome supervision, including +13.6 points for LLaMA-3.1-8B-Base and +3.8 for Qwen-2.5-7B-Base on MuSiQue, plus strong cross-domain improvements (+9.3/+8.1 on average). The results demonstrate that high-quality, ground-truth-grounded CoTs, learned through self-supervision with robust filtering, can significantly enhance long-context reasoning in a scalable manner.

Abstract

Recent advances in Large Language Models (LLMs) have highlighted the challenge of handling long-context tasks, where models need to reason over extensive input contexts to aggregate target information. While Chain-of-Thought (CoT) prompting has shown promise for multi-step reasoning, its effectiveness for long-context scenarios remains underexplored. Through systematic investigation across diverse tasks, we demonstrate that CoT's benefits generalize across most long-context scenarios and amplify with increasing context length. Motivated by this critical observation, we propose LongRePS, a process-supervised framework that teaches models to generate high-quality reasoning paths for enhanced long-context performance. Our framework incorporates a self-sampling mechanism to bootstrap reasoning paths and a novel quality assessment protocol specifically designed for long-context scenarios. Experimental results on various long-context benchmarks demonstrate the effectiveness of our approach, achieving significant improvements over outcome supervision baselines on both in-domain tasks (+13.6/+3.8 points for LLaMA/Qwen on MuSiQue) and cross-domain generalization (+9.3/+8.1 points on average across diverse QA tasks). Our code, data and trained models are made public to facilitate future research.

Paper Structure

This paper contains 18 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: (a) Average gain w/ CoT prompting of open-source and proprietary models on long-context datasets of various domains and length tiers. SQA, MQA, LICL, Syn is short for Single-Document QA, Multi-Document QA, Long In-Context Learning, and Synthetic tasks, respectively. Short, Medium, Long denotes different length tiers (<32k, 32-96k, >96k). Details see Section \ref{['sec:cot_effectiveness']}. (b) Zero-shot majority voting results w.r.t. sampling rounds on MuSiQue, w/ and w/o CoT prompting.
  • Figure 2: Performance gain of CoT for synthetic (MNR3, S-NIAH) and real-world (SQA, MQA, LICL) long context scenarios across all models. It is demonstrated that CoT particularly benefits proprietary and large-scale open-source models, and its effectiveness ranges across most long context scenarios, except for extremely easy retrieval tasks.
  • Figure 3: Our process-supervised framework LongRePS. We begin by sampling a diverse collection of $N$ reasoning paths from the model. A quality assessment procedure consisting of three criteria is then applied to these samples to select high-quality training samples, which are then used for supervised fine-tuning.
  • Figure 4: Impact of sampling size on model performance on MuSiQue.
  • Figure 5: The impact of CoTs of different quality on model performance. Notations align with Table \ref{['tab:main_results']}.
  • ...and 1 more figures