Table of Contents
Fetching ...

Generation-Time vs. Post-hoc Citation: A Holistic Evaluation of LLM Attribution

Yash Saxena, Raviteja Bommireddy, Ankur Padia, Manas Gaur

TL;DR

This paper compares Generation-Time Citation (G-Cite) and Post-hoc Citation (P-Cite) paradigms for LLM attribution in high-stakes domains. Using eight methods across four datasets and retrieval-augmented setups, it evaluates coverage, correctness, and latency via both automated metrics and human judgments. Key findings show that retrieval quality is the dominant factor for attribution performance across paradigms; P-Cite achieves higher coverage with competitive correctness, while G-Cite yields higher precision but slower, less comprehensive outputs. The authors recommend a retrieval-centric, P-Cite-first approach for high-stakes deployments and reserve G-Cite for precision-critical claim verification, with code and human results released for reproducibility.

Abstract

Trustworthy Large Language Models (LLMs) must cite human-verifiable sources in high-stakes domains such as healthcare, law, academia, and finance, where even small errors can have severe consequences. Practitioners and researchers face a choice: let models generate citations during decoding, or let models draft answers first and then attach appropriate citations. To clarify this choice, we introduce two paradigms: Generation-Time Citation (G-Cite), which produces the answer and citations in one pass, and Post-hoc Citation (P-Cite), which adds or verifies citations after drafting. We conduct a comprehensive evaluation from zero-shot to advanced retrieval-augmented methods across four popular attribution datasets and provide evidence-based recommendations that weigh trade-offs across use cases. Our results show a consistent trade-off between coverage and citation correctness, with retrieval as the main driver of attribution quality in both paradigms. P-Cite methods achieve high coverage with competitive correctness and moderate latency, whereas G-Cite methods prioritize precision at the cost of coverage and speed. We recommend a retrieval-centric, P-Cite-first approach for high-stakes applications, reserving G-Cite for precision-critical settings such as strict claim verification. Our codes and human evaluation results are available at https://anonymous.4open.science/r/Citation_Paradigms-BBB5/

Generation-Time vs. Post-hoc Citation: A Holistic Evaluation of LLM Attribution

TL;DR

This paper compares Generation-Time Citation (G-Cite) and Post-hoc Citation (P-Cite) paradigms for LLM attribution in high-stakes domains. Using eight methods across four datasets and retrieval-augmented setups, it evaluates coverage, correctness, and latency via both automated metrics and human judgments. Key findings show that retrieval quality is the dominant factor for attribution performance across paradigms; P-Cite achieves higher coverage with competitive correctness, while G-Cite yields higher precision but slower, less comprehensive outputs. The authors recommend a retrieval-centric, P-Cite-first approach for high-stakes deployments and reserve G-Cite for precision-critical claim verification, with code and human results released for reproducibility.

Abstract

Trustworthy Large Language Models (LLMs) must cite human-verifiable sources in high-stakes domains such as healthcare, law, academia, and finance, where even small errors can have severe consequences. Practitioners and researchers face a choice: let models generate citations during decoding, or let models draft answers first and then attach appropriate citations. To clarify this choice, we introduce two paradigms: Generation-Time Citation (G-Cite), which produces the answer and citations in one pass, and Post-hoc Citation (P-Cite), which adds or verifies citations after drafting. We conduct a comprehensive evaluation from zero-shot to advanced retrieval-augmented methods across four popular attribution datasets and provide evidence-based recommendations that weigh trade-offs across use cases. Our results show a consistent trade-off between coverage and citation correctness, with retrieval as the main driver of attribution quality in both paradigms. P-Cite methods achieve high coverage with competitive correctness and moderate latency, whereas G-Cite methods prioritize precision at the cost of coverage and speed. We recommend a retrieval-centric, P-Cite-first approach for high-stakes applications, reserving G-Cite for precision-critical settings such as strict claim verification. Our codes and human evaluation results are available at https://anonymous.4open.science/r/Citation_Paradigms-BBB5/

Paper Structure

This paper contains 11 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Citation Quality Trends. Average citation correctness, entailed coverage, and latency across categories (Zero-shot, Fine-tuned, RAG, Advanced) for the G-Cite and P-Cite paradigms, averaged over all datasets.
  • Figure 2: Human Evaluation Results. We report Answer Correctness ($\uparrow$), and Citation Hallucination ($\downarrow$), values are averaged over all datasets and methods within each paradigm (G-Cite and P-Cite).P-Cite based methods tend to provide more correct answers with lesser hallucination.
  • Figure 3: Coverage and correctness deltas (P-Cite - G-Cite) across datasets. Positive values indicate P-Cite outperforms G-Cite on the metric; negative values indicate otherwise.