LitLLMs, LLMs for Literature Review: Are we there yet?
Shubham Agarwal, Gaurav Sahu, Abhay Puri, Issam H. Laradji, Krishnamurthy DJ Dvijotham, Jason Stanley, Laurent Charlin, Christopher Pal
TL;DR
This work investigates zero-shot capabilities of recent LLMs for literature-review generation by decomposing the task into retrieval and writing. It introduces a two-step retrieval pipeline that uses LLM-generated keywords and embedding-based search, paired with attribution-enabled re-ranking, and a plan-based generation pipeline that conditions on retrieved context to produce the related-work section. The authors construct rolling evaluation datasets (RollingEval-Aug/Dec) from recent arXiv papers to avoid test-set contamination and demonstrate that plan-based prompting and retrieval-augmented generation substantially improve quality and reduce hallucinations, while attribution verification enhances ranking reliability. Although results are promising and show close-to-ready potential for assisting researchers, the study also highlights ongoing challenges in achieving complete end-to-end retrieval coverage and fully eliminating fabrications in generated text.
Abstract
Literature reviews are an essential component of scientific research, but they remain time-intensive and challenging to write, especially due to the recent influx of research papers. This paper explores the zero-shot abilities of recent Large Language Models (LLMs) in assisting with the writing of literature reviews based on an abstract. We decompose the task into two components: 1. Retrieving related works given a query abstract, and 2. Writing a literature review based on the retrieved results. We analyze how effective LLMs are for both components. For retrieval, we introduce a novel two-step search strategy that first uses an LLM to extract meaningful keywords from the abstract of a paper and then retrieves potentially relevant papers by querying an external knowledge base. Additionally, we study a prompting-based re-ranking mechanism with attribution and show that re-ranking doubles the normalized recall compared to naive search methods, while providing insights into the LLM's decision-making process. In the generation phase, we propose a two-step approach that first outlines a plan for the review and then executes steps in the plan to generate the actual review. To evaluate different LLM-based literature review methods, we create test sets from arXiv papers using a protocol designed for rolling use with newly released LLMs to avoid test set contamination in zero-shot evaluations. We release this evaluation protocol to promote additional research and development in this regard. Our empirical results suggest that LLMs show promising potential for writing literature reviews when the task is decomposed into smaller components of retrieval and planning. Our project page including a demonstration system and toolkit can be accessed here: https://litllm.github.io.
