Patience is all you need! An agentic system for performing scientific literature review
David Brett, Anniek Myatt
TL;DR
The paper presents an agentic, LLM-enabled system for performing scientific literature reviews by combining sparse retrieval with LLM-based query expansion, long-context processing, BM25L re-ranking, and a multi-stage information extraction pipeline. It adds a verification and diversification layer (CoVe) to plan and check key statements, and introduces literature expansion to broaden coverage beyond initial results. Evaluations on LitQA2 and PubMedQA show competitive retrieval performance and credible attribution, with the full workflow achieving substantial precision and reasonable recall, while end-to-end review generation can be completed in roughly 10–30 minutes. The work demonstrates that sparse retrieval, augmented by targeted expansion and verification, can yield well-grounded reviews without dense retrieval infrastructure, offering practical impact for faster, sourced literature synthesis while highlighting areas for benchmark and scalability improvements.
Abstract
Large language models (LLMs) have grown in their usage to provide support for question answering across numerous disciplines. The models on their own have already shown promise for answering basic questions, however fail quickly where expert domain knowledge is required or the question is nuanced. Scientific research often involves searching for relevant literature, distilling pertinent information from that literature and analysing how the findings support or contradict one another. The information is often encapsulated in the full text body of research articles, rather than just in the abstracts. Statements within these articles frequently require the wider article context to be fully understood. We have built an LLM-based system that performs such search and distillation of information encapsulated in scientific literature, and we evaluate our keyword based search and information distillation system against a set of biology related questions from previously released literature benchmarks. We demonstrate sparse retrieval methods exhibit results close to state of the art without the need for dense retrieval, with its associated infrastructure and complexity overhead. We also show how to increase the coverage of relevant documents for literature review generation.
