Evaluating Large Language Models on Historical Health Crisis Knowledge in Resource-Limited Settings: A Hybrid Multi-Metric Study

Mohammed Rakibul Hasan

Evaluating Large Language Models on Historical Health Crisis Knowledge in Resource-Limited Settings: A Hybrid Multi-Metric Study

Mohammed Rakibul Hasan

Abstract

Large Language Models (LLMs) offer significant potential for delivering health information. However, their reliability in low-resource contexts remains uncertain. This study evaluates GPT-4, Gemini Pro, Llama~3, and Mistral-7B on health crisis-related enquiries concerning COVID-19, dengue, the Nipah virus, and Chikungunya in the low-resource context of Bangladesh. We constructed a question--answer dataset from authoritative sources and assessed model outputs through semantic similarity, expert-model cross-evaluation, and Natural Language Inference (NLI). Findings highlight both the strengths and limitations of LLMs in representing epidemiological history and health crisis knowledge, underscoring their promise and risks for informing policy in resource-constrained environments.

Evaluating Large Language Models on Historical Health Crisis Knowledge in Resource-Limited Settings: A Hybrid Multi-Metric Study

Abstract

Paper Structure (35 sections, 1 equation, 8 figures, 3 tables)

This paper contains 35 sections, 1 equation, 8 figures, 3 tables.

Introduction
Related Work
LLMs in Healthcare and Public Health
Challenges of Reliability: Hallucination and Factuality
Evaluations in Resource-Limited Settings
Bangladesh as a Case Context
Methodology
Phase 1: Question Generation
Data Sources
Question Generation by Models
Final Dataset
Phase 2: Response Collection
Prompt Template and Standardisation
Programmatic Calls
Data Storage and Pre-processing
...and 20 more sections

Figures (8)

Figure 1: Overview of the methodology.
Figure 2: Inter-model semantic similarity heatmap. Each cell shows the mean cosine similarity between the row and column model across all 100 questions, computed from all-MiniLM-L6-v2 embeddings (384-dimensional). The diagonal is masked. Colour scale: 0.50 (light) to 1.00 (dark blue). Mistral-7B--Gemini Pro is the highest-similarity pair ($0.765$); Llama-3--Gemini Pro is the lowest ($0.671$).
Figure 3: Divergence category distributions. Left (Fig. 3a): Horizontal bar chart of overall category counts across all 100 questions---82 Needs Review (orange), 18 High Divergence / Hallucination Likely (red), 0 Highly Consistent (green)---with percentage labels. Right (Fig. 3b): Stacked bar chart by disease domain showing that all 18 high-divergence questions belong exclusively to the COVID-19 domain ($72\%$), while Dengue, Chikungunya, and Nipah Virus contain only Needs Review questions.
Figure 4: Per-question mean pairwise cosine similarity across all 100 benchmark questions, colour-coded by disease domain (Dengue: red, COVID-19: blue, Chikungunya: green, Nipah Virus: purple). Horizontal dashed lines mark the Highly Consistent ($\geq 0.85$, upper) and Needs Review ($\geq 0.65$, lower) thresholds. Vertical dotted lines delimit domain boundaries at questions 26, 51, and 76. All points below the lower threshold (questions 26--50) originate from the COVID-19 domain.
Figure 5: NLI label distribution by model as percentage of total sentences per model. Each model contributes 100 sentences. Bars are grouped by model on the x-axis; colours indicate label: Entailment (green), Neutral (orange), Contradiction (red). The legend is positioned outside the plot to the right. Note Gemini Pro's uniquely bimodal pattern: highest entailment (22%) and highest contradiction (43%) simultaneously.
...and 3 more figures

Evaluating Large Language Models on Historical Health Crisis Knowledge in Resource-Limited Settings: A Hybrid Multi-Metric Study

Abstract

Evaluating Large Language Models on Historical Health Crisis Knowledge in Resource-Limited Settings: A Hybrid Multi-Metric Study

Authors

Abstract

Table of Contents

Figures (8)