Table of Contents
Fetching ...

LitLLMs, LLMs for Literature Review: Are we there yet?

Shubham Agarwal, Gaurav Sahu, Abhay Puri, Issam H. Laradji, Krishnamurthy DJ Dvijotham, Jason Stanley, Laurent Charlin, Christopher Pal

TL;DR

This work investigates zero-shot capabilities of recent LLMs for literature-review generation by decomposing the task into retrieval and writing. It introduces a two-step retrieval pipeline that uses LLM-generated keywords and embedding-based search, paired with attribution-enabled re-ranking, and a plan-based generation pipeline that conditions on retrieved context to produce the related-work section. The authors construct rolling evaluation datasets (RollingEval-Aug/Dec) from recent arXiv papers to avoid test-set contamination and demonstrate that plan-based prompting and retrieval-augmented generation substantially improve quality and reduce hallucinations, while attribution verification enhances ranking reliability. Although results are promising and show close-to-ready potential for assisting researchers, the study also highlights ongoing challenges in achieving complete end-to-end retrieval coverage and fully eliminating fabrications in generated text.

Abstract

Literature reviews are an essential component of scientific research, but they remain time-intensive and challenging to write, especially due to the recent influx of research papers. This paper explores the zero-shot abilities of recent Large Language Models (LLMs) in assisting with the writing of literature reviews based on an abstract. We decompose the task into two components: 1. Retrieving related works given a query abstract, and 2. Writing a literature review based on the retrieved results. We analyze how effective LLMs are for both components. For retrieval, we introduce a novel two-step search strategy that first uses an LLM to extract meaningful keywords from the abstract of a paper and then retrieves potentially relevant papers by querying an external knowledge base. Additionally, we study a prompting-based re-ranking mechanism with attribution and show that re-ranking doubles the normalized recall compared to naive search methods, while providing insights into the LLM's decision-making process. In the generation phase, we propose a two-step approach that first outlines a plan for the review and then executes steps in the plan to generate the actual review. To evaluate different LLM-based literature review methods, we create test sets from arXiv papers using a protocol designed for rolling use with newly released LLMs to avoid test set contamination in zero-shot evaluations. We release this evaluation protocol to promote additional research and development in this regard. Our empirical results suggest that LLMs show promising potential for writing literature reviews when the task is decomposed into smaller components of retrieval and planning. Our project page including a demonstration system and toolkit can be accessed here: https://litllm.github.io.

LitLLMs, LLMs for Literature Review: Are we there yet?

TL;DR

This work investigates zero-shot capabilities of recent LLMs for literature-review generation by decomposing the task into retrieval and writing. It introduces a two-step retrieval pipeline that uses LLM-generated keywords and embedding-based search, paired with attribution-enabled re-ranking, and a plan-based generation pipeline that conditions on retrieved context to produce the related-work section. The authors construct rolling evaluation datasets (RollingEval-Aug/Dec) from recent arXiv papers to avoid test-set contamination and demonstrate that plan-based prompting and retrieval-augmented generation substantially improve quality and reduce hallucinations, while attribution verification enhances ranking reliability. Although results are promising and show close-to-ready potential for assisting researchers, the study also highlights ongoing challenges in achieving complete end-to-end retrieval coverage and fully eliminating fabrications in generated text.

Abstract

Literature reviews are an essential component of scientific research, but they remain time-intensive and challenging to write, especially due to the recent influx of research papers. This paper explores the zero-shot abilities of recent Large Language Models (LLMs) in assisting with the writing of literature reviews based on an abstract. We decompose the task into two components: 1. Retrieving related works given a query abstract, and 2. Writing a literature review based on the retrieved results. We analyze how effective LLMs are for both components. For retrieval, we introduce a novel two-step search strategy that first uses an LLM to extract meaningful keywords from the abstract of a paper and then retrieves potentially relevant papers by querying an external knowledge base. Additionally, we study a prompting-based re-ranking mechanism with attribution and show that re-ranking doubles the normalized recall compared to naive search methods, while providing insights into the LLM's decision-making process. In the generation phase, we propose a two-step approach that first outlines a plan for the review and then executes steps in the plan to generate the actual review. To evaluate different LLM-based literature review methods, we create test sets from arXiv papers using a protocol designed for rolling use with newly released LLMs to avoid test set contamination in zero-shot evaluations. We release this evaluation protocol to promote additional research and development in this regard. Our empirical results suggest that LLMs show promising potential for writing literature reviews when the task is decomposed into smaller components of retrieval and planning. Our project page including a demonstration system and toolkit can be accessed here: https://litllm.github.io.

Paper Structure

This paper contains 25 sections, 3 equations, 16 figures, 14 tables, 1 algorithm.

Figures (16)

  • Figure 1: A schematic diagram of our framework, where: 1) Relevant prior work is retrieved using keyword and embedding-based search. 2) LLMs re-rank results to find the most relevant prior work. 3) Based on these papers and the user abstract or idea summary, an LLM generates a literature review, 4) optionally controlled by a sentence plan.
  • Figure 2: Effect of re-ranking strategies on the RollingEval-Dec dataset. We use the entire dataset ($n=500$) and set $k=100$ for these experiments. We evaluate the Precision and Normalized Recall of the re-ranked results with embedding-based ranker (SPECTER2) outperforming GPT-4 based re-ranking. We find a similar pattern for the RollingEval-Aug dataset, as shown in Appendix (Figure \ref{['appendix-figure:retrieval-pr-curves']}). Note: The first part in the legend denotes the search database for retrieval, and the second denotes the re-ranking mechanism.
  • Figure 3: The effect of removing the referenced content verification step in our debate ranking strategy. We plot precision and normalized recall for two variants of the debate ranking strategy. For this ablation study, we select a smaller subset of $n=100$ query abstracts, set $k=40$, and repeat the experiment for three random seeds. We plot the mean and show the standard deviation as the shaded region. We find that the precision and normalized recall drop slightly upon removing the verification step. This difference is significant (as determined by the t-test,) indicating that the verification step is crucial for the success of the debate ranking strategy.
  • Figure 4: Pipeline of generation task where the model needs to generate the related work of the query paper conditioned on reference papers. Our method employs an optional plan --- shown by the dotted purple box, either generated by the model or appended to the prompt.
  • Figure 5: Human evaluation study where annotators ranked the generations of 0-shot models with their sentence-plan-based counterparts. On the Y-axis, we show counts from an overall sample size of 58 annotations for Llama 2-Chat and 54 for GPT-4 (where ranking ties are allowed). We see a reduction of 58.6% cases of hallucinations to 32.7% for Llama 2-Chat and 29.6% to 11.6% for GPT-4 using plan-based prompting.
  • ...and 11 more figures