Table of Contents
Fetching ...

LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA

Jiajie Zhang, Yushi Bai, Xin Lv, Wanjun Gu, Danqing Liu, Minhao Zou, Shulin Cao, Lei Hou, Yuxiao Dong, Ling Feng, Juanzi Li

TL;DR

The paper tackles the verification gap in long-context LLMs by enabling fine-grained sentence-level citations within responses. It introduces LongBench-Cite for automatic LQAC evaluation, a CoF pipeline to construct a large-scale LQAC dataset (LongCite-45k), and trains LongCite-8B and LongCite-9B models that produce accurate answers with precise citations in one pass. Empirical results show these models achieve state-of-the-art citation quality, outperforming advanced proprietary systems such as GPT-4o on citation metrics, while also improving overall answer correctness. The work demonstrates that SFT on LQAC data improves faithfulness and provides a practical pathway toward more trustworthy long-context QA systems.

Abstract

Though current long-context large language models (LLMs) have demonstrated impressive capacities in answering user questions based on extensive text, the lack of citations in their responses makes user verification difficult, leading to concerns about their trustworthiness due to their potential hallucinations. In this work, we aim to enable long-context LLMs to generate responses with fine-grained sentence-level citations, improving their faithfulness and verifiability. We first introduce LongBench-Cite, an automated benchmark for assessing current LLMs' performance in Long-Context Question Answering with Citations (LQAC), revealing considerable room for improvement. To this end, we propose CoF (Coarse to Fine), a novel pipeline that utilizes off-the-shelf LLMs to automatically generate long-context QA instances with precise sentence-level citations, and leverage this pipeline to construct LongCite-45k, a large-scale SFT dataset for LQAC. Finally, we train LongCite-8B and LongCite-9B using the LongCite-45k dataset, successfully enabling their generation of accurate responses and fine-grained sentence-level citations in a single output. The evaluation results on LongBench-Cite show that our trained models achieve state-of-the-art citation quality, surpassing advanced proprietary models including GPT-4o.

LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA

TL;DR

The paper tackles the verification gap in long-context LLMs by enabling fine-grained sentence-level citations within responses. It introduces LongBench-Cite for automatic LQAC evaluation, a CoF pipeline to construct a large-scale LQAC dataset (LongCite-45k), and trains LongCite-8B and LongCite-9B models that produce accurate answers with precise citations in one pass. Empirical results show these models achieve state-of-the-art citation quality, outperforming advanced proprietary systems such as GPT-4o on citation metrics, while also improving overall answer correctness. The work demonstrates that SFT on LQAC data improves faithfulness and provides a practical pathway toward more trustworthy long-context QA systems.

Abstract

Though current long-context large language models (LLMs) have demonstrated impressive capacities in answering user questions based on extensive text, the lack of citations in their responses makes user verification difficult, leading to concerns about their trustworthiness due to their potential hallucinations. In this work, we aim to enable long-context LLMs to generate responses with fine-grained sentence-level citations, improving their faithfulness and verifiability. We first introduce LongBench-Cite, an automated benchmark for assessing current LLMs' performance in Long-Context Question Answering with Citations (LQAC), revealing considerable room for improvement. To this end, we propose CoF (Coarse to Fine), a novel pipeline that utilizes off-the-shelf LLMs to automatically generate long-context QA instances with precise sentence-level citations, and leverage this pipeline to construct LongCite-45k, a large-scale SFT dataset for LQAC. Finally, we train LongCite-8B and LongCite-9B using the LongCite-45k dataset, successfully enabling their generation of accurate responses and fine-grained sentence-level citations in a single output. The evaluation results on LongBench-Cite show that our trained models achieve state-of-the-art citation quality, surpassing advanced proprietary models including GPT-4o.
Paper Structure (24 sections, 2 equations, 14 figures, 11 tables)

This paper contains 24 sections, 2 equations, 14 figures, 11 tables.

Figures (14)

  • Figure 1: Comparison between chunk-level and sentence-level citations.
  • Figure 2: Overview of our CoF pipeline. The pipeline consists of four steps: (1) Generating long-context QA instance via Self-Instruct; (2) Using the answer to retrieve $k$ context chunks and generating chunk-level citations; (3) Extracting sentence-level citations for each statement from the cited chunks. (4) Filter out LQAC instances with few citations.
  • Figure 3: Citation F1 mean and std. w.r.t correctness of LongCite-9B's responses.
  • Figure 4: prompt for correctness evaluation on LongBench-Chat.
  • Figure 5: Performance of models using different training data on LongBench-Chat.
  • ...and 9 more figures