Ground Every Sentence: Improving Retrieval-Augmented LLMs with Interleaved Reference-Claim Generation

Sirui Xia; Xintao Wang; Jiaqing Liang; Yifei Zhang; Weikang Zhou; Jiaji Deng; Fei Yu; Yanghua Xiao

Ground Every Sentence: Improving Retrieval-Augmented LLMs with Interleaved Reference-Claim Generation

Sirui Xia, Xintao Wang, Jiaqing Liang, Yifei Zhang, Weikang Zhou, Jiaji Deng, Fei Yu, Yanghua Xiao

TL;DR

The paper tackles verifiability and credibility in Retrieval-Augmented Generation by introducing ReClaim, a method that interleaves sentence-level references and claims to produce highly granular attributions in long-form answers. It builds specialized training data from WebGLM-QA and ELI5, and uses constrained decoding with a prefix-tree to ensure references precisely align with generated sentences. Two main variants are proposed: ReClaim_Unified for end-to-end one-step generation and ReClaim w/IG, which trains separate ReferModel and ClaimModel and alternates their outputs during inference. Across ASQA, ELI5, and EXPERTQA, ReClaim improves citation quality and verifiability (high CAS and reduced citation length) while maintaining strong fluency, though there are some trade-offs in overall answer accuracy under certain configurations. The work demonstrates that sentence-level attribution via interleaved generation can meaningfully enhance the credibility and verifiability of RAG-based QA systems, with practical impact for systems requiring verifiable sourcing and efficient fact-checking.

Abstract

Retrieval-Augmented Generation (RAG) has been widely adopted to enhance Large Language Models (LLMs) in knowledge-intensive tasks. To enhance credibility and verifiability in RAG systems, Attributed Text Generation (ATG) is proposed, which provides citations to retrieval knowledge in LLM-generated responses. Prior methods mainly adopt coarse-grained attributions, with passage-level or paragraph-level references or citations, which fall short in verifiability. This paper proposes ReClaim (Refer & Claim), a fine-grained ATG method that alternates the generation of references and answers step by step. Different from previous coarse-grained attribution, ReClaim provides sentence-level citations in long-form question-answering tasks. With extensive experiments, we verify the effectiveness of ReClaim in extensive settings, achieving a citation accuracy rate of 90%.

Ground Every Sentence: Improving Retrieval-Augmented LLMs with Interleaved Reference-Claim Generation

TL;DR

Abstract

Paper Structure (79 sections, 12 figures, 11 tables, 1 algorithm)

This paper contains 79 sections, 12 figures, 11 tables, 1 algorithm.

Introduction
Related Work
Retrieval-Augmented Generation
Long-form Text Question Answering
Attributed Text Generation
Method
ReClaim: Interleaving Reference and Claim
Training Dataset Construction
Reference Passages Retrieval
Model Answer Generation
Multi-Stage Citation Search
Unified Generation
Interleaving Generation
Reference Generation
Claim Generation
...and 64 more sections

Figures (12)

Figure 1: The task setup for ReClaim. Given question and reference passages from a large corpus. The LLM then generates a response with fine-grained citations. For detailed examples, see Table \ref{['table:testdatacase']}.
Figure 2: The generation process of ReClaim w/IG. Based on the given questions and the reference passages retrieved, the LLM alternately generates the reference parts and the claim parts in a step-by-step manner. For these two stages of generation, distinct datasets are constructed to train the base model, which alternately switches between the fine-tuned models and the input context during inference.
Figure 3: Comparison of the performance of LLMs trained on the training dataset before and after filtering on the ELI5 dataset. The comparative experimental results of the ASQA and EXPERTQA datasets are presented in Appendix Figure \ref{['fig:data_filter_asqa']} and Figure \ref{['fig:data_filter_expertqa']}.
Figure 4: The x-axis represents the accuracy of the LLM's responses, while the y-axis shows the faithfulness score. For the Self-RAG and RS+RL methods, we use the fine-tuned 7B model, whereas for other methods, we employ Llama3-8B-Instruction as the base model.
Figure 5: Reranker score distribution between query and reference passages in the training dataset.
...and 7 more figures

Ground Every Sentence: Improving Retrieval-Augmented LLMs with Interleaved Reference-Claim Generation

TL;DR

Abstract

Ground Every Sentence: Improving Retrieval-Augmented LLMs with Interleaved Reference-Claim Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (12)