Table of Contents
Fetching ...

AgentSLR: Automating Systematic Literature Reviews in Epidemiology with Agentic AI

Shreyansh Padarha, Ryan Othniel Kearns, Tristan Naidoo, Lingyi Yang, Łukasz Borchmann, Piotr BŁaszczyk, Christian Morgenstern, Ruth McCabe, Sangeeta Bhatia, Philip H. Torr, Jakob Foerster, Scott A. Hale, Thomas Rawson, Anne Cori, Elizaveta Semenova, Adam Mahdi

Abstract

Systematic literature reviews are essential for synthesizing scientific evidence but are costly, difficult to scale and time-intensive, creating bottlenecks for evidence-based policy. We study whether large language models can automate the complete systematic review workflow, from article retrieval, article screening, data extraction to report synthesis. Applied to epidemiological reviews of nine WHO-designated priority pathogens and validated against expert-curated ground truth, our open-source agentic pipeline (AgentSLR) achieves performance comparable to human researchers while reducing review time from approximately 7 weeks to 20 hours (a 58x speed-up). Our comparison of five frontier models reveals that performance on SLR is driven less by model size or inference cost than by each model's distinctive capabilities. Through human-in-the-loop validation, we identify key failure modes. Our results demonstrate that agentic AI can substantially accelerate scientific evidence synthesis in specialised domains.

AgentSLR: Automating Systematic Literature Reviews in Epidemiology with Agentic AI

Abstract

Systematic literature reviews are essential for synthesizing scientific evidence but are costly, difficult to scale and time-intensive, creating bottlenecks for evidence-based policy. We study whether large language models can automate the complete systematic review workflow, from article retrieval, article screening, data extraction to report synthesis. Applied to epidemiological reviews of nine WHO-designated priority pathogens and validated against expert-curated ground truth, our open-source agentic pipeline (AgentSLR) achieves performance comparable to human researchers while reducing review time from approximately 7 weeks to 20 hours (a 58x speed-up). Our comparison of five frontier models reveals that performance on SLR is driven less by model size or inference cost than by each model's distinctive capabilities. Through human-in-the-loop validation, we identify key failure modes. Our results demonstrate that agentic AI can substantially accelerate scientific evidence synthesis in specialised domains.
Paper Structure (90 sections, 14 equations, 13 figures, 27 tables)

This paper contains 90 sections, 14 equations, 13 figures, 27 tables.

Figures (13)

  • Figure 1: End-to-end agentic pipeline (AgentSLR) for automated systematic literature reviews. The pipeline demonstrates a complete automation of the systematic review workflow in epidemiology, using open-source modular components. (a) Article Search and Retrieval queries bibliographic databases with domain-specific Boolean searches and obtains PDF from open-access sources. (b) Title and Abstract Screening applies language reasoning models to filter articles using expert-designed inclusion/exclusion criteria. (c) PDF-to-Markdown Conversion uses an image-to-text OCR model to convert PDFs to machine-readable Markdown. (d) Full-text Screening applies stricter filtering criteria than (b). (e) Data Extraction employs multi-stage tool-calling with schema validation to extract structured epidemiological data (parameters, models, outbreaks). (f) Report Generation synthesises extracted data through programmatic descriptive generation followed by iterative LRM self-refinement (writing, critique and evidence grounding). For more details see Section \ref{['sec:pipeline']}.
  • Figure 2: Human vs. AgentSLR SLR completion time. AgentSLR (with GPT-OSS-120B) completes the end-to-end workflow in $20$ hours versus $385$ hours taken for manual-conducted reviews ($19.3\times$ speed-up). Running continuously, this corresponds to less than 1 day ($0.83$) versus $48.1$ human workdays (assuming 8-hour days), yielding $58\times$ calendar-time savings. Of AgentSLR's run-time: data extraction accounts for $13.4$ hours ($67\%$), title and abstract screening for $3.2$ hours ($16\%$), PDF-to-MD conversion for $2.8$ hours ($14\%$), and full-text screening under $1$ hour. Times shown reflect processing of $9,132$ articles at abstract screening, $1,102$ at full-text screening and $395$ at data extraction. Report generation ($\leq$$5$ minutes per pathogen) has been omitted. For more information, see Appendix \ref{['app:pipeline_statistics']}.
  • Figure 3: Recall of article screening strategies across pathogens. Two ablation screening strategies (human-conditioned, direct full-text) with AgentSLR (GPT-OSS-120B) offer better recall (or 'fetch rate') than performing traditional AI-based two stage screening, with bootstrapped confidence intervals (95% C.I.; 10,000 resamples) between the two ablations overlapping across most pathogens. Full article screening metrics along with individual title & abstract stage screening results are reported in \ref{['app:extended_results_article_screening']}.
  • Figure 4: Human expert evaluation of data extraction quality across stages. We report expert-rated flagging precision, field-level extraction accuracy, and perceived AgentSLR (gpt-oss-120b) competence for parameter, model, and outbreak extractions, aggregated across six epidemiologists. Error bars denote standard errors, and dashed lines indicate mean competence ratings ($4.2$ for parameters, $2.8$ for models, and $3.9$ for outbreaks).
  • Figure 5: Model ablation results with AgentSLR across all pipeline stages. Macro F1 is reported for five client models, evaluated separately for each pathogen. Averages are computed over the pathogens evaluated at each stage, following ground-truth availability described in Section \ref{['sec:methods_data']}. Error bars indicate one standard deviation across pathogens. For the three data extraction panels, coloured dots show the macro F1 of the Flagging $\bullet$, Counts $\bullet$, and Extraction $\bullet$ sub-tasks, plotted to the left of each bar. No single model dominates across all stages: Kimi-K2.5 and gpt-oss-120b lead screening, while extraction leaders vary by data type. Full pathogen-wise metrics are provided in Appendix \ref{['app:model_ablations']}.
  • ...and 8 more figures