Zero-shot Generative Large Language Models for Systematic Review Screening Automation
Shuai Wang, Harrisen Scells, Shengyao Zhuang, Martin Potthast, Bevan Koopman, Guido Zuccon
TL;DR
This work tackles the bottleneck of screening in systematic reviews by evaluating eight open, zero-shot large language models under uncalibrated and calibrated decision rules, plus a CombSUM ensemble, on public datasets (CLEF TAR and Seed Collection). Screening decisions rely on next-token likelihoods, with $P(\texttt{yes}|d,t)$ and $P(\texttt{no}|d,t)$ guiding inclusion; calibration uses $S(d,t)=P(\texttt{yes}|d,t)-P(\texttt{no}|d,t)$ and a threshold $\theta$ via $S_{\text{norm}}(d,t)$ to meet recall targets. Results show that instruction-based fine-tuning improves performance, calibration reliably achieves target recall, and calibrated ensembles can outperform fine-tuned baselines like Bio-SIEVE across datasets. The findings suggest a practical, cost-effective path to integrate zero-shot LLMs into systematic-review workflows, enabling substantial reductions in manual screening time without extensive task-specific labeling. Overall, the study demonstrates that open-source zero-shot LLMs, when properly calibrated and ensemble-augmented, can deliver recall-focused performance suitable for real-world evidence synthesis.
Abstract
Systematic reviews are crucial for evidence-based medicine as they comprehensively analyse published research findings on specific questions. Conducting such reviews is often resource- and time-intensive, especially in the screening phase, where abstracts of publications are assessed for inclusion in a review. This study investigates the effectiveness of using zero-shot large language models~(LLMs) for automatic screening. We evaluate the effectiveness of eight different LLMs and investigate a calibration technique that uses a predefined recall threshold to determine whether a publication should be included in a systematic review. Our comprehensive evaluation using five standard test collections shows that instruction fine-tuning plays an important role in screening, that calibration renders LLMs practical for achieving a targeted recall, and that combining both with an ensemble of zero-shot models saves significant screening time compared to state-of-the-art approaches.
