Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases

Mercy Asiedu; Nenad Tomasev; Chintan Ghate; Tiya Tiyasirichokchai; Awa Dieng; Oluwatosin Akande; Geoffrey Siwo; Steve Adudans; Sylvanus Aitkins; Odianosen Ehiakhamen; Eric Ndombi; Katherine Heller

Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases

Mercy Asiedu, Nenad Tomasev, Chintan Ghate, Tiya Tiyasirichokchai, Awa Dieng, Oluwatosin Akande, Geoffrey Siwo, Steve Adudans, Sylvanus Aitkins, Odianosen Ehiakhamen, Eric Ndombi, Katherine Heller

TL;DR

This work benchmarks large language models on tropical and infectious disease classification using an expanded TRINDs dataset with demographic, semantic, and consumer augmentations, totaling 11,719 prompts. It demonstrates that contextual information such as demographics, location, and risk factors significantly improves LLM accuracy, while counterfactuals can degrade performance unless robust context is present. The study compares a generalist Gemini Ultra with a health-specialist MedLM Medium, showing larger models generally perform better and that many-shot in-context learning boosts robustness, particularly for expanded prompts. A human expert baseline and the TRINDs-LM playground are developed to enable scalable, context-aware evaluation and exploration of how context shapes model outputs in health settings, with implications for policy and global health decision-support.

Abstract

While large language models (LLMs) have shown promise for medical question answering, there is limited work focused on tropical and infectious disease-specific exploration. We build on an opensource tropical and infectious diseases (TRINDs) dataset, expanding it to include demographic and semantic clinical and consumer augmentations yielding 11000+ prompts. We evaluate LLM performance on these, comparing generalist and medical LLMs, as well as LLM outcomes to human experts. We demonstrate through systematic experimentation, the benefit of contextual information such as demographics, location, gender, risk factors for optimal LLM response. Finally we develop a prototype of TRINDs-LM, a research tool that provides a playground to navigate how context impacts LLM outputs for health.

Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases

TL;DR

Abstract

Paper Structure (25 sections, 17 figures, 2 tables)

This paper contains 25 sections, 17 figures, 2 tables.

Introduction
Methods
Dataset generation and expansion:
Model evaluation
Auto-rater LLM Evaluations
Human Expert Baseline
TRINDs-LM Tool design and development
Statistical analysis
Results
LLM experimental results
Generalist (Gemini Ultra) and specialist (MedLM Medium) model performance on persona variations
Assessing Gemini model performance on varied combinations of attributes and factors
Assessing Gemini model performance on counterfactual inputs
Assessing model performance on the demographic and semantic expansions
Impact of many-shot in-context learning with original persona set on model robustness and generalizability
...and 10 more sections

Figures (17)

Figure 1: Model performance on persona variations. a) Generalist (Gemini) and specialist (MedLM) model performance on clinical, consumer and French persona variations. b) Gemini model performance on counterfactual location inputs. c) Gemini performance for contextual combinations of attributes and factors and count. d) Gemini performance for race counterfactuals e) Gemini performance for gender counterfactuals. Legend: S=symptoms(general and specific), gS=general symptoms, sS= specific symptoms, L=location, A=attribute (age and gender) Error bars are 90% confidence interval. *=p<0.025
Figure 2: Per disease performance for LLMs and human experts, a) LLM performance on original persona with different contextual combinations (5 repeated runs), b) LLM performance on location counterfactual with different contextual combinations (5 repeated runs), c) Human expert performance (top 5 out of 7 experts). Error bars are 90% confidence interval. Legend: S=symptoms(general and specific), gS=general symptoms, sS= specific symptoms, L=location, A=attribute (age and gender), R=risk factor, FP=full persona, Exp_Tot=total expert score, Exp_Maj = expert majority score, Exp_Any = expert any/at least one score, Exp_All = Expert all score.
Figure 3: Model performance on expanded dataset. a) LLM performance on demographic clinical and consumer augmentations (2635 each) and b)LLM performance on semantic clinical and consumer augmentations (2651 each). We compared the base model, with the multi-shot tuned model. ****=p<0.00005
Figure 4: Expert baseline and data quality rating. a)Expert baseline compared to LLM, b) Expert rating of data quality, c) Expert rating of helpfulness of contextual information. Error bars are 90% CI. *=p<0.05 ****=p<0.00005
Figure 5: TRINDs research tool showing user entry
...and 12 more figures

Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases

TL;DR

Abstract

Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases

Authors

TL;DR

Abstract

Table of Contents

Figures (17)