Context-Enhanced Language Models for Generating Multi-Paper Citations
Avinash Anand, Kritarth Prasad, Ujjwal Goel, Mohit Gupta, Naman Lal, Astha Verma, Rajiv Ratn Shah
TL;DR
The paper tackles multi-sentence citation text generation by formulating CTG as producing a coherent paragraph that cites multiple papers given a source abstract. It introduces the MCG-S2ORC dataset, derived from the S2ORC corpus for the computer science domain, containing $17{,}210$ samples with 2–3 target papers per source and rich citation metadata. Three LLMs—LLaMA, Alpaca, and Vicuna—are fine-tuned for this task, with performance significantly improved by prompting that includes knowledge-graph relations extracted via PL-Marker, integrated from the source and target abstracts (and introductions/conclusions). The experiments show Vicuna generally outperforming baselines and that knowledge-graph–augmented prompts yield substantial gains across METEOR and ROUGE metrics, underscoring the potential of context-aware prompting to enhance cross-document citation generation. The work contributes a new dataset, a validated prompting strategy, and evidence for knowledge graphs enabling more coherent, context-rich multi-paper citations, with limitations stemming from token-length constraints demanding future scaling.
Abstract
Citation text plays a pivotal role in elucidating the connection between scientific documents, demanding an in-depth comprehension of the cited paper. Constructing citations is often time-consuming, requiring researchers to delve into extensive literature and grapple with articulating relevant content. To address this challenge, the field of citation text generation (CTG) has emerged. However, while earlier methods have primarily centered on creating single-sentence citations, practical scenarios frequently necessitate citing multiple papers within a single paragraph. To bridge this gap, we propose a method that leverages Large Language Models (LLMs) to generate multi-citation sentences. Our approach involves a single source paper and a collection of target papers, culminating in a coherent paragraph containing multi-sentence citation text. Furthermore, we introduce a curated dataset named MCG-S2ORC, composed of English-language academic research papers in Computer Science, showcasing multiple citation instances. In our experiments, we evaluate three LLMs LLaMA, Alpaca, and Vicuna to ascertain the most effective model for this endeavor. Additionally, we exhibit enhanced performance by integrating knowledge graphs from target papers into the prompts for generating citation text. This research underscores the potential of harnessing LLMs for citation generation, opening a compelling avenue for exploring the intricate connections between scientific documents.
