Table of Contents
Fetching ...

Zero-shot Generative Large Language Models for Systematic Review Screening Automation

Shuai Wang, Harrisen Scells, Shengyao Zhuang, Martin Potthast, Bevan Koopman, Guido Zuccon

TL;DR

This work tackles the bottleneck of screening in systematic reviews by evaluating eight open, zero-shot large language models under uncalibrated and calibrated decision rules, plus a CombSUM ensemble, on public datasets (CLEF TAR and Seed Collection). Screening decisions rely on next-token likelihoods, with $P(\texttt{yes}|d,t)$ and $P(\texttt{no}|d,t)$ guiding inclusion; calibration uses $S(d,t)=P(\texttt{yes}|d,t)-P(\texttt{no}|d,t)$ and a threshold $\theta$ via $S_{\text{norm}}(d,t)$ to meet recall targets. Results show that instruction-based fine-tuning improves performance, calibration reliably achieves target recall, and calibrated ensembles can outperform fine-tuned baselines like Bio-SIEVE across datasets. The findings suggest a practical, cost-effective path to integrate zero-shot LLMs into systematic-review workflows, enabling substantial reductions in manual screening time without extensive task-specific labeling. Overall, the study demonstrates that open-source zero-shot LLMs, when properly calibrated and ensemble-augmented, can deliver recall-focused performance suitable for real-world evidence synthesis.

Abstract

Systematic reviews are crucial for evidence-based medicine as they comprehensively analyse published research findings on specific questions. Conducting such reviews is often resource- and time-intensive, especially in the screening phase, where abstracts of publications are assessed for inclusion in a review. This study investigates the effectiveness of using zero-shot large language models~(LLMs) for automatic screening. We evaluate the effectiveness of eight different LLMs and investigate a calibration technique that uses a predefined recall threshold to determine whether a publication should be included in a systematic review. Our comprehensive evaluation using five standard test collections shows that instruction fine-tuning plays an important role in screening, that calibration renders LLMs practical for achieving a targeted recall, and that combining both with an ensemble of zero-shot models saves significant screening time compared to state-of-the-art approaches.

Zero-shot Generative Large Language Models for Systematic Review Screening Automation

TL;DR

This work tackles the bottleneck of screening in systematic reviews by evaluating eight open, zero-shot large language models under uncalibrated and calibrated decision rules, plus a CombSUM ensemble, on public datasets (CLEF TAR and Seed Collection). Screening decisions rely on next-token likelihoods, with and guiding inclusion; calibration uses and a threshold via to meet recall targets. Results show that instruction-based fine-tuning improves performance, calibration reliably achieves target recall, and calibrated ensembles can outperform fine-tuned baselines like Bio-SIEVE across datasets. The findings suggest a practical, cost-effective path to integrate zero-shot LLMs into systematic-review workflows, enabling substantial reductions in manual screening time without extensive task-specific labeling. Overall, the study demonstrates that open-source zero-shot LLMs, when properly calibrated and ensemble-augmented, can deliver recall-focused performance suitable for real-world evidence synthesis.

Abstract

Systematic reviews are crucial for evidence-based medicine as they comprehensively analyse published research findings on specific questions. Conducting such reviews is often resource- and time-intensive, especially in the screening phase, where abstracts of publications are assessed for inclusion in a review. This study investigates the effectiveness of using zero-shot large language models~(LLMs) for automatic screening. We evaluate the effectiveness of eight different LLMs and investigate a calibration technique that uses a predefined recall threshold to determine whether a publication should be included in a systematic review. Our comprehensive evaluation using five standard test collections shows that instruction fine-tuning plays an important role in screening, that calibration renders LLMs practical for achieving a targeted recall, and that combining both with an ensemble of zero-shot models saves significant screening time compared to state-of-the-art approaches.
Paper Structure (12 sections, 6 equations, 1 figure, 4 tables)

This paper contains 12 sections, 6 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Our framework for automatic document screening using generative LLMs. $P(\texttt{yes}|d,t)$ ($P(\texttt{no}|d,t)$) is the likelihood of the yes (no) token in the next token probability list, and $\theta$ is the decision boundary(threshold) used by the calibrated setting.