Table of Contents
Fetching ...

Leveraging LLMs for Semi-Automatic Corpus Filtration in Systematic Literature Reviews

Lucas Joos, Daniel A. Keim, Maximilian T. Fischer

TL;DR

This paper presents LLMSurver, an open-source, human-in-the-loop pipeline for semi-automatic corpus filtration in systematic literature reviews. By ensembling multiple LLMs and applying a consensus scheme, the approach substantially reduces manual screening time while preserving high recall, demonstrated on a large 8.3k-paper corpus retrieved for a real SLR. Across mid-2024 and fall-2025 model cohorts, open-source models increasingly matched or exceeded commercial performance, with careful prompt design and interactive supervision enabling robust, transparent decision-making. The work advances responsible AI-assisted research workflows, emphasizing privacy, reproducibility, and practical integration into academic practice.

Abstract

The creation of systematic literature reviews (SLR) is critical for analyzing the landscape of a research field and guiding future research directions. However, retrieving and filtering the literature corpus for an SLR is highly time-consuming and requires extensive manual effort, as keyword-based searches in digital libraries often return numerous irrelevant publications. In this work, we propose a pipeline leveraging multiple large language models (LLMs), classifying papers based on descriptive prompts and deciding jointly using a consensus scheme. The entire process is human-supervised and interactively controlled via our open-source visual analytics web interface, LLMSurver, which enables real-time inspection and modification of model outputs. We evaluate our approach using ground-truth data from a recent SLR comprising over 8,000 candidate papers, benchmarking both open and commercial state-of-the-art LLMs from mid-2024 and fall 2025. Results demonstrate that our pipeline significantly reduces manual effort while achieving lower error rates than single human annotators. Furthermore, modern open-source models prove sufficient for this task, making the method accessible and cost-effective. Overall, our work demonstrates how responsible human-AI collaboration can accelerate and enhance systematic literature reviews within academic workflows.

Leveraging LLMs for Semi-Automatic Corpus Filtration in Systematic Literature Reviews

TL;DR

This paper presents LLMSurver, an open-source, human-in-the-loop pipeline for semi-automatic corpus filtration in systematic literature reviews. By ensembling multiple LLMs and applying a consensus scheme, the approach substantially reduces manual screening time while preserving high recall, demonstrated on a large 8.3k-paper corpus retrieved for a real SLR. Across mid-2024 and fall-2025 model cohorts, open-source models increasingly matched or exceeded commercial performance, with careful prompt design and interactive supervision enabling robust, transparent decision-making. The work advances responsible AI-assisted research workflows, emphasizing privacy, reproducibility, and practical integration into academic practice.

Abstract

The creation of systematic literature reviews (SLR) is critical for analyzing the landscape of a research field and guiding future research directions. However, retrieving and filtering the literature corpus for an SLR is highly time-consuming and requires extensive manual effort, as keyword-based searches in digital libraries often return numerous irrelevant publications. In this work, we propose a pipeline leveraging multiple large language models (LLMs), classifying papers based on descriptive prompts and deciding jointly using a consensus scheme. The entire process is human-supervised and interactively controlled via our open-source visual analytics web interface, LLMSurver, which enables real-time inspection and modification of model outputs. We evaluate our approach using ground-truth data from a recent SLR comprising over 8,000 candidate papers, benchmarking both open and commercial state-of-the-art LLMs from mid-2024 and fall 2025. Results demonstrate that our pipeline significantly reduces manual effort while achieving lower error rates than single human annotators. Furthermore, modern open-source models prove sufficient for this task, making the method accessible and cost-effective. Overall, our work demonstrates how responsible human-AI collaboration can accelerate and enhance systematic literature reviews within academic workflows.

Paper Structure

This paper contains 20 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Schematic overview of leveraging LLM-based agents for structured literature filtration in a systematic literature review. Keyword-based search in online libraries generates a large set of candidate papers that are preprocessed, and classified by multiple LLMs based on title and abstract using a customized prompt. A consensus voting scheme determines the final inclusion or rejection. The user keeps the control through inspecting and adapting all steps, including the initial database selection and search, preprocessing, LLM models, prompts, and the consensus scheme.
  • Figure 2: An exemplary prompt template for individual LLM agents.
  • Figure 3: Overview of the user interface of our open-source application LLMSurver, which implements the proposed pipeline. The interface includes a paper table , a prompt editor , controls for LLM selection and classification execution , a consensus scheme and statistics view , and visual plots for comparative analysis . The application can be used freely at https://llmsurver.dbvis.de.
  • Figure 4: Overview of pairwise comparisons showing incorrect decisions by the agents for the Mid-2024 models. The ground-truth classification of each paper is shown on the top row (left side: included, right side: discarded), followed by the decisions of individual agents, and finally by the two consensus methods (all models and top-3 models). Incorrect exclusions (FN) appear as red discarded lines on the left, whereas incorrect inclusions (FP) appear as green included lines on the larger right side. Note the single incorrectly discarded paper (FN) for the consensus methods at the bottom left, which reduces recall.
  • Figure 5: Number of papers (gray background) that were incorrectly (inc.) classified as included (left) or excluded (right) by the Mid-2024 LLM agents, grouped by the number of agents involved in each incorrect decision. The individual bars indicate how often a specific agent contributed to a wrong decision. On the far right, only one paper is misclassified by all agents (and therefore lost permanently), illustrating that N-Consensus voting is advantageous when prioritizing recall.
  • ...and 4 more figures