Table of Contents
Fetching ...

Forecasting Antimicrobial Resistance Trends Using Machine Learning on WHO GLASS Surveillance Data: A Retrieval-Augmented Generation Approach for Policy Decision Support

Md Tanvir Hasan Turja

TL;DR

A two-component framework for AMR trend forecasting and evidence-grounded policy decision support is presented and a Retrieval-Augmented Generation pipeline combining a ChromaDB vector store of WHO policy documents with a locally deployed Phi-3 Mini language model is implemented, producing source-attributed, hallucination-constrained policy answers.

Abstract

Antimicrobial resistance (AMR) is a growing global crisis projected to cause 10 million deaths per year by 2050. While the WHO Global Antimicrobial Resistance and Use Surveillance System (GLASS) provides standardized surveillance data across 44 countries, few studies have applied machine learning to forecast population-level resistance trends from this data. This paper presents a two-component framework for AMR trend forecasting and evidence-grounded policy decision support. We benchmark six models -- Naive, Linear Regression, Ridge Regression, XGBoost, LightGBM, and LSTM -- on 5,909 WHO GLASS observations across six WHO regions (2021-2023). XGBoost achieved the best performance with a test MAE of 7.07% and R-squared of 0.854, outperforming the naive baseline by 83.1%. Feature importance analysis identified the prior-year resistance rate as the dominant predictor (50.5% importance), while regional MAE ranged from 4.16% (European Region) to 10.14% (South-East Asia Region). We additionally implemented a Retrieval-Augmented Generation (RAG) pipeline combining a ChromaDB vector store of WHO policy documents with a locally deployed Phi-3 Mini language model, producing source-attributed, hallucination-constrained policy answers. Code and data are available at https://github.com/TanvirTurja

Forecasting Antimicrobial Resistance Trends Using Machine Learning on WHO GLASS Surveillance Data: A Retrieval-Augmented Generation Approach for Policy Decision Support

TL;DR

A two-component framework for AMR trend forecasting and evidence-grounded policy decision support is presented and a Retrieval-Augmented Generation pipeline combining a ChromaDB vector store of WHO policy documents with a locally deployed Phi-3 Mini language model is implemented, producing source-attributed, hallucination-constrained policy answers.

Abstract

Antimicrobial resistance (AMR) is a growing global crisis projected to cause 10 million deaths per year by 2050. While the WHO Global Antimicrobial Resistance and Use Surveillance System (GLASS) provides standardized surveillance data across 44 countries, few studies have applied machine learning to forecast population-level resistance trends from this data. This paper presents a two-component framework for AMR trend forecasting and evidence-grounded policy decision support. We benchmark six models -- Naive, Linear Regression, Ridge Regression, XGBoost, LightGBM, and LSTM -- on 5,909 WHO GLASS observations across six WHO regions (2021-2023). XGBoost achieved the best performance with a test MAE of 7.07% and R-squared of 0.854, outperforming the naive baseline by 83.1%. Feature importance analysis identified the prior-year resistance rate as the dominant predictor (50.5% importance), while regional MAE ranged from 4.16% (European Region) to 10.14% (South-East Asia Region). We additionally implemented a Retrieval-Augmented Generation (RAG) pipeline combining a ChromaDB vector store of WHO policy documents with a locally deployed Phi-3 Mini language model, producing source-attributed, hallucination-constrained policy answers. Code and data are available at https://github.com/TanvirTurja
Paper Structure (24 sections, 4 figures, 4 tables)

This paper contains 24 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Model performance comparison across all six models on validation and test sets. XGBoost achieves the lowest test MAE (7.07%), with LightGBM and LSTM closely behind.
  • Figure 2: XGBoost feature importance (gain-based). Resistance_lag1 dominates at 50.5%, confirming strong temporal autocorrelation in AMR resistance rates.
  • Figure 3: XGBoost test MAE disaggregated by WHO region. The European Region achieves the lowest error (4.16%), while South-East Asia shows the highest (10.14%), reflecting disparities in GLASS data coverage.
  • Figure 4: XGBoost residual and error analysis on the 2023 test set. Most predictions fall within a narrow error range, with larger errors concentrated in high-resistance observations.