LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data
Cehao Yang, Xueyuan Lin, Chengjin Xu, Xuhui Jiang, Shengjie Ma, Aofan Liu, Hui Xiong, Jian Guo
TL;DR
LongFaith tackles faithfulness problems in synthetic long-context reasoning data by integrating ground-truth information with chain-of-citation prompts to generate faithful reasoning chains. It introduces LongFaith-SFT for supervised fine-tuning and LongFaith-PO for preference optimization, both open-sourced, to train LLMs on faithful long-context reasoning. Across MuSiQue, 2WikiMultiHopQA, HotpotQA, and LongBench, models trained with LongFaith data show improved long-context reasoning and QA performance, with LongFaith-PO delivering the strongest gains. Ablation studies highlight the importance of attribution-based reasoning, diverse faithfulness dimensions, and cross-LLM robustness, suggesting the approach generalizes to longer contexts and different model sizes.
Abstract
Despite the growing development of long-context large language models (LLMs), data-centric approaches relying on synthetic data have been hindered by issues related to faithfulness, which limit their effectiveness in enhancing model performance on tasks such as long-context reasoning and question answering (QA). These challenges are often exacerbated by misinformation caused by lack of verification, reasoning without attribution, and potential knowledge conflicts. We propose LongFaith, a novel pipeline for synthesizing faithful long-context reasoning instruction datasets. By integrating ground truth and citation-based reasoning prompts, we eliminate distractions and improve the accuracy of reasoning chains, thus mitigating the need for costly verification processes. We open-source two synthesized datasets, LongFaith-SFT and LongFaith-PO, which systematically address multiple dimensions of faithfulness, including verified reasoning, attribution, and contextual grounding. Extensive experiments on multi-hop reasoning datasets and LongBench demonstrate that models fine-tuned on these datasets significantly improve performance. Our ablation studies highlight the scalability and adaptability of the LongFaith pipeline, showcasing its broad applicability in developing long-context LLMs.
