Can large language models replace humans in the systematic review process? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages

Qusai Khraisha; Sophie Put; Johanna Kappenberg; Azza Warraitch; Kristin Hadfield

Can large language models replace humans in the systematic review process? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages

Qusai Khraisha, Sophie Put, Johanna Kappenberg, Azza Warraitch, Kristin Hadfield

TL;DR

This study evaluates GPT‑4’s autonomous capability in key systematic‑review tasks—title/abstract screening, full‑text screening, and data extraction—across peer‑reviewed, grey, and non‑English literature within a pre‑registered, human‑out‑of‑the‑loop workflow. It employs a dataset of 300 titles/abstracts, 150 full texts, and 30 extracts and analyzes outputs with metrics based on $TP$, $TN$, $FP$, $FN$, as well as $S_e$, $S_p$, and $A$, alongside agreement measures $\kappa$, $\mathrm{PABAK}$, and weighted $\kappa$, while accounting for prevalence around $3\%$. Key findings show that results are strongly influenced by chance agreement and data imbalance; after adjustments, data extraction achieved moderate performance and screening ranged from none to moderate, except for full‑text screening with highly reliable prompts, which yielded near‑perfect agreement ($\kappa \approx .91$, weighted $\kappa \approx .97$). The work highlights the potential for AI‑assisted systematic reviews under controlled prompt design, but also underscores substantial caution and the need for human oversight, given variability by language and literature type and the importance of prompt reliability.

Abstract

Systematic reviews are vital for guiding practice, research, and policy, yet they are often slow and labour-intensive. Large language models (LLMs) could offer a way to speed up and automate systematic reviews, but their performance in such tasks has not been comprehensively evaluated against humans, and no study has tested GPT-4, the biggest LLM so far. This pre-registered study evaluates GPT-4's capability in title/abstract screening, full-text review, and data extraction across various literature types and languages using a 'human-out-of-the-loop' approach. Although GPT-4 had accuracy on par with human performance in most tasks, results were skewed by chance agreement and dataset imbalance. After adjusting for these, there was a moderate level of performance for data extraction, and - barring studies that used highly reliable prompts - screening performance levelled at none to moderate for different stages and languages. When screening full-text literature using highly reliable prompts, GPT-4's performance was 'almost perfect.' Penalising GPT-4 for missing key studies using highly reliable prompts improved its performance even more. Our findings indicate that, currently, substantial caution should be used if LLMs are being used to conduct systematic reviews, but suggest that, for certain systematic review tasks delivered under reliable prompts, LLMs can rival human performance.

Can large language models replace humans in the systematic review process? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages

TL;DR

Abstract

Can large language models replace humans in the systematic review process? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages

Authors

TL;DR

Abstract

Table of Contents

Figures (2)