Cited Text Spans for Citation Text Generation
Xiangci Li, Yi-Hui Lee, Jessica Ouyang
TL;DR
This work advances citation text generation by grounding outputs in the exact text spans (CTS) of cited papers rather than relying solely on abstracts, addressing the hallucination risk of abstractive methods. It demonstrates that distantly labeled CTS can scale to large datasets while maintaining fidelity to ground truth, and it introduces practical CTS retrieval (Context, Oracle, Keyword) and generation (RAG-FiD, LED) strategies evaluated on the CORWA dataset. The findings show CTS-based generation yields higher token overlap with target citations and improved faithfulness compared to abstract-only baselines, though fully automatic CTS retrieval remains challenging and benefits from a human-in-the-loop approach. The study highlights practical considerations for grounding, including dataset design, retrieval quality, and potential post-processing to mitigate plagiarism, offering a feasible path toward reliable, text-grounded citation generation in real-world use.
Abstract
An automatic citation generation system aims to concisely and accurately describe the relationship between two scientific articles. To do so, such a system must ground its outputs to the content of the cited paper to avoid non-factual hallucinations. Due to the length of scientific documents, existing abstractive approaches have conditioned only on cited paper abstracts. We demonstrate empirically that the abstract is not always the most appropriate input for citation generation and that models trained in this way learn to hallucinate. We propose to condition instead on the cited text span (CTS) as an alternative to the abstract. Because manual CTS annotation is extremely time- and labor-intensive, we experiment with distant labeling of candidate CTS sentences, achieving sufficiently strong performance to substitute for expensive human annotations in model training, and we propose a human-in-the-loop, keyword-based CTS retrieval approach that makes generating citation texts grounded in the full text of cited papers both promising and practical.
