Table of Contents
Fetching ...

Facilitating Long Context Understanding via Supervised Chain-of-Thought Reasoning

Jingyang Lin, Andy Wong, Tian Xia, Shenghua He, Hui Wei, Mei Han, Jiebo Luo

TL;DR

The paper tackles the challenge of long-context understanding in LLMs by introducing LongFinanceQA, a synthetic long-context QA dataset annotated with intermediate chain-of-thought reasoning. It presents Property-based Agentic Inference (PAI), a three-step framework that generates reasoning-augmented answers, and demonstrates that fine-tuning a lightweight model with supervised CoT (LongPAI) substantially improves long-context performance. Empirical results on the Loong and ∞Bench benchmarks show substantial gains for both PAI (as data annotator) and LongPAI (as a trained model), including strong gains over baselines and competitive results against teacher models, while also highlighting efficiency advantages. The work emphasizes the importance of explicit intermediate reasoning for long-context tasks and provides a scalable approach to producing high-quality reasoning data for domain-specific applications.

Abstract

Recent advances in Large Language Models (LLMs) have enabled them to process increasingly longer sequences, ranging from 2K to 2M tokens and even beyond. However, simply extending the input sequence length does not necessarily lead to effective long-context understanding. In this study, we integrate Chain-of-Thought (CoT) reasoning into LLMs in a supervised manner to facilitate effective long-context understanding. To achieve this, we introduce LongFinanceQA, a synthetic dataset in the financial domain designed to improve long-context reasoning. Unlike existing long-context synthetic data, LongFinanceQA includes intermediate CoT reasoning before the final conclusion, which encourages LLMs to perform explicit reasoning, improving accuracy and interpretability in long-context understanding. To generate synthetic CoT reasoning, we propose Property-based Agentic Inference (PAI), an agentic framework that simulates human-like reasoning steps, including property extraction, retrieval, and summarization. We evaluate PAI's reasoning capabilities by assessing GPT-4o-mini w/ PAI on the Loong benchmark, outperforming standard GPT-4o-mini by 20.0%. Furthermore, we fine-tune LLaMA-3.1-8B-Instruct on LongFinanceQA, achieving a 28.0% gain on Loong's financial subset.

Facilitating Long Context Understanding via Supervised Chain-of-Thought Reasoning

TL;DR

The paper tackles the challenge of long-context understanding in LLMs by introducing LongFinanceQA, a synthetic long-context QA dataset annotated with intermediate chain-of-thought reasoning. It presents Property-based Agentic Inference (PAI), a three-step framework that generates reasoning-augmented answers, and demonstrates that fine-tuning a lightweight model with supervised CoT (LongPAI) substantially improves long-context performance. Empirical results on the Loong and ∞Bench benchmarks show substantial gains for both PAI (as data annotator) and LongPAI (as a trained model), including strong gains over baselines and competitive results against teacher models, while also highlighting efficiency advantages. The work emphasizes the importance of explicit intermediate reasoning for long-context tasks and provides a scalable approach to producing high-quality reasoning data for domain-specific applications.

Abstract

Recent advances in Large Language Models (LLMs) have enabled them to process increasingly longer sequences, ranging from 2K to 2M tokens and even beyond. However, simply extending the input sequence length does not necessarily lead to effective long-context understanding. In this study, we integrate Chain-of-Thought (CoT) reasoning into LLMs in a supervised manner to facilitate effective long-context understanding. To achieve this, we introduce LongFinanceQA, a synthetic dataset in the financial domain designed to improve long-context reasoning. Unlike existing long-context synthetic data, LongFinanceQA includes intermediate CoT reasoning before the final conclusion, which encourages LLMs to perform explicit reasoning, improving accuracy and interpretability in long-context understanding. To generate synthetic CoT reasoning, we propose Property-based Agentic Inference (PAI), an agentic framework that simulates human-like reasoning steps, including property extraction, retrieval, and summarization. We evaluate PAI's reasoning capabilities by assessing GPT-4o-mini w/ PAI on the Loong benchmark, outperforming standard GPT-4o-mini by 20.0%. Furthermore, we fine-tune LLaMA-3.1-8B-Instruct on LongFinanceQA, achieving a 28.0% gain on Loong's financial subset.

Paper Structure

This paper contains 18 sections, 4 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: The hype of long-context large language models. The results shown are from the Loong benchmark, where green points refer to open-source LLMs and red points indicate closed-source LLMs. Our GPT-4o-mini w/ PAI stands out as the red open circle.
  • Figure 2: Overview of Property-based Agentic Inference (PAI), containing three stages. A Property Extraction Agent identifies key properties $\hbox{\boldmath $p$}_i$ from the given query $\mathbf{Q}$, where each property consists of a measurable metric and its corresponding subject. Given the selected properties, a Property-based Retrieval Agent first transforms each property into a sub-query $\hbox{\boldmath $q$}_i$ to retrieve relevant content chunks from long documents, yielding intermediate findings. A Summarization Agent integrates these intermediate findings to generate a comprehensive conclusion $\mathbf{A}$. After finishing PAI, we incorporate the output from the above three agents to produce reasoning-augmented answers. These augmented answers serve as the core contribution of the LongFinanceQA.
  • Figure 3: Token length distribution of answers with and without CoT reasoning from Multi-Source QA pairs in the proposed LongFinanceQA dataset.
  • Figure 4: Function calling details of property extraction agent (top) and property-based retrieval (bottom).
  • Figure 5: The histogram of input token length in the LongFinanceQA (left). The proportion of single-source and multi-source QA tasks in the LongFinanceQA (right).
  • ...and 3 more figures