Leveraging Large Language Models for Comparative Literature Summarization with Reflective Incremental Mechanisms
Fernando Gabriela Garcia, Spencer Burns, Harrison Fuller
TL;DR
The paper tackles the challenge of automatic literature reviews that meaningfully compare multiple papers under long-context conditions. It introduces ChatCite, a generative, memory-augmented LLM framework trained through a multi-stage process (pre-training, comparative fine-tuning) to produce integrated comparative summaries, with a long-context memory mechanism to handle large document sets. A new dataset, CompLit-LongContext (and CiteComp-1000 in related descriptions), and the Comparative Quality Score (G-Score) are introduced to evaluate cross-paper synthesis. Empirical results show ChatCite outperforms GPT-4, BART, T5, and CoT on ROUGE metrics and G-Score, complemented by favorable human judgments, indicating improved coherence, insight, and fluency. The work promises practical impact for scalable, high-quality literature reviews and suggests directions for broader domains and improved interpretability.
Abstract
In this paper, we introduce ChatCite, a novel method leveraging large language models (LLMs) for generating comparative literature summaries. The ability to summarize research papers with a focus on key comparisons between studies is an essential task in academic research. Existing summarization models, while effective at generating concise summaries, fail to provide deep comparative insights. ChatCite addresses this limitation by incorporating a multi-step reasoning mechanism that extracts critical elements from papers, incrementally builds a comparative summary, and refines the output through a reflective memory process. We evaluate ChatCite on a custom dataset, CompLit-LongContext, consisting of 1000 research papers with annotated comparative summaries. Experimental results show that ChatCite outperforms several baseline methods, including GPT-4, BART, T5, and CoT, across various automatic evaluation metrics such as ROUGE and the newly proposed G-Score. Human evaluation further confirms that ChatCite generates more coherent, insightful, and fluent summaries compared to these baseline models. Our method provides a significant advancement in automatic literature review generation, offering researchers a powerful tool for efficiently comparing and synthesizing scientific research.
