Table of Contents
Fetching ...

LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data

Cehao Yang, Xueyuan Lin, Chengjin Xu, Xuhui Jiang, Shengjie Ma, Aofan Liu, Hui Xiong, Jian Guo

TL;DR

LongFaith tackles faithfulness problems in synthetic long-context reasoning data by integrating ground-truth information with chain-of-citation prompts to generate faithful reasoning chains. It introduces LongFaith-SFT for supervised fine-tuning and LongFaith-PO for preference optimization, both open-sourced, to train LLMs on faithful long-context reasoning. Across MuSiQue, 2WikiMultiHopQA, HotpotQA, and LongBench, models trained with LongFaith data show improved long-context reasoning and QA performance, with LongFaith-PO delivering the strongest gains. Ablation studies highlight the importance of attribution-based reasoning, diverse faithfulness dimensions, and cross-LLM robustness, suggesting the approach generalizes to longer contexts and different model sizes.

Abstract

Despite the growing development of long-context large language models (LLMs), data-centric approaches relying on synthetic data have been hindered by issues related to faithfulness, which limit their effectiveness in enhancing model performance on tasks such as long-context reasoning and question answering (QA). These challenges are often exacerbated by misinformation caused by lack of verification, reasoning without attribution, and potential knowledge conflicts. We propose LongFaith, a novel pipeline for synthesizing faithful long-context reasoning instruction datasets. By integrating ground truth and citation-based reasoning prompts, we eliminate distractions and improve the accuracy of reasoning chains, thus mitigating the need for costly verification processes. We open-source two synthesized datasets, LongFaith-SFT and LongFaith-PO, which systematically address multiple dimensions of faithfulness, including verified reasoning, attribution, and contextual grounding. Extensive experiments on multi-hop reasoning datasets and LongBench demonstrate that models fine-tuned on these datasets significantly improve performance. Our ablation studies highlight the scalability and adaptability of the LongFaith pipeline, showcasing its broad applicability in developing long-context LLMs.

LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data

TL;DR

LongFaith tackles faithfulness problems in synthetic long-context reasoning data by integrating ground-truth information with chain-of-citation prompts to generate faithful reasoning chains. It introduces LongFaith-SFT for supervised fine-tuning and LongFaith-PO for preference optimization, both open-sourced, to train LLMs on faithful long-context reasoning. Across MuSiQue, 2WikiMultiHopQA, HotpotQA, and LongBench, models trained with LongFaith data show improved long-context reasoning and QA performance, with LongFaith-PO delivering the strongest gains. Ablation studies highlight the importance of attribution-based reasoning, diverse faithfulness dimensions, and cross-LLM robustness, suggesting the approach generalizes to longer contexts and different model sizes.

Abstract

Despite the growing development of long-context large language models (LLMs), data-centric approaches relying on synthetic data have been hindered by issues related to faithfulness, which limit their effectiveness in enhancing model performance on tasks such as long-context reasoning and question answering (QA). These challenges are often exacerbated by misinformation caused by lack of verification, reasoning without attribution, and potential knowledge conflicts. We propose LongFaith, a novel pipeline for synthesizing faithful long-context reasoning instruction datasets. By integrating ground truth and citation-based reasoning prompts, we eliminate distractions and improve the accuracy of reasoning chains, thus mitigating the need for costly verification processes. We open-source two synthesized datasets, LongFaith-SFT and LongFaith-PO, which systematically address multiple dimensions of faithfulness, including verified reasoning, attribution, and contextual grounding. Extensive experiments on multi-hop reasoning datasets and LongBench demonstrate that models fine-tuned on these datasets significantly improve performance. Our ablation studies highlight the scalability and adaptability of the LongFaith pipeline, showcasing its broad applicability in developing long-context LLMs.

Paper Structure

This paper contains 32 sections, 9 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: A brief introduction of LongFaith. Synthesized long-context reasoning instruction sets and preference datasets are fed into the post-training stage of downstream LLMs.
  • Figure 2: Overview of LongFaith pipeline for synthesizing faithful long-context reasoning instruction and preference datasets. Comparing generated reasoning chains with misinformation, lack of attribution, and knowledge conflicts, LongFaith generates ground truth guidance prompting by chain-of-citation to build LongFaith-SFT. Fine-grained faithfulness is modeled by optimization on our preference datasets LongFaith-PO.
  • Figure 3: Performance of Llama-3.1-8B-Instruct trained on different size of instructions synthesized by Qwen2.5-7B-Instruct from 1K to 8K, evaluated by EM and F1 metrics on multi-hop reasoning sets and LongBench.
  • Figure 4: Scatter plot with a linear regression line fitting the relationship between QA - EM and Attribution - F1 metrics on three long-context multi-hop reasoning test sets. A point refers to the performance of a model trained with a specific size between 1K to 8K by SFT or PO.
  • Figure 5: Visualization of F1 scores in Tab. \ref{['tab:longbench_exp']}.
  • ...and 6 more figures