Table of Contents
Fetching ...

Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases

Mercy Asiedu, Nenad Tomasev, Chintan Ghate, Tiya Tiyasirichokchai, Awa Dieng, Oluwatosin Akande, Geoffrey Siwo, Steve Adudans, Sylvanus Aitkins, Odianosen Ehiakhamen, Eric Ndombi, Katherine Heller

TL;DR

This work benchmarks large language models on tropical and infectious disease classification using an expanded TRINDs dataset with demographic, semantic, and consumer augmentations, totaling 11,719 prompts. It demonstrates that contextual information such as demographics, location, and risk factors significantly improves LLM accuracy, while counterfactuals can degrade performance unless robust context is present. The study compares a generalist Gemini Ultra with a health-specialist MedLM Medium, showing larger models generally perform better and that many-shot in-context learning boosts robustness, particularly for expanded prompts. A human expert baseline and the TRINDs-LM playground are developed to enable scalable, context-aware evaluation and exploration of how context shapes model outputs in health settings, with implications for policy and global health decision-support.

Abstract

While large language models (LLMs) have shown promise for medical question answering, there is limited work focused on tropical and infectious disease-specific exploration. We build on an opensource tropical and infectious diseases (TRINDs) dataset, expanding it to include demographic and semantic clinical and consumer augmentations yielding 11000+ prompts. We evaluate LLM performance on these, comparing generalist and medical LLMs, as well as LLM outcomes to human experts. We demonstrate through systematic experimentation, the benefit of contextual information such as demographics, location, gender, risk factors for optimal LLM response. Finally we develop a prototype of TRINDs-LM, a research tool that provides a playground to navigate how context impacts LLM outputs for health.

Contextual Evaluation of Large Language Models for Classifying Tropical and Infectious Diseases

TL;DR

This work benchmarks large language models on tropical and infectious disease classification using an expanded TRINDs dataset with demographic, semantic, and consumer augmentations, totaling 11,719 prompts. It demonstrates that contextual information such as demographics, location, and risk factors significantly improves LLM accuracy, while counterfactuals can degrade performance unless robust context is present. The study compares a generalist Gemini Ultra with a health-specialist MedLM Medium, showing larger models generally perform better and that many-shot in-context learning boosts robustness, particularly for expanded prompts. A human expert baseline and the TRINDs-LM playground are developed to enable scalable, context-aware evaluation and exploration of how context shapes model outputs in health settings, with implications for policy and global health decision-support.

Abstract

While large language models (LLMs) have shown promise for medical question answering, there is limited work focused on tropical and infectious disease-specific exploration. We build on an opensource tropical and infectious diseases (TRINDs) dataset, expanding it to include demographic and semantic clinical and consumer augmentations yielding 11000+ prompts. We evaluate LLM performance on these, comparing generalist and medical LLMs, as well as LLM outcomes to human experts. We demonstrate through systematic experimentation, the benefit of contextual information such as demographics, location, gender, risk factors for optimal LLM response. Finally we develop a prototype of TRINDs-LM, a research tool that provides a playground to navigate how context impacts LLM outputs for health.
Paper Structure (25 sections, 17 figures, 2 tables)

This paper contains 25 sections, 17 figures, 2 tables.

Figures (17)

  • Figure 1: Model performance on persona variations. a) Generalist (Gemini) and specialist (MedLM) model performance on clinical, consumer and French persona variations. b) Gemini model performance on counterfactual location inputs. c) Gemini performance for contextual combinations of attributes and factors and count. d) Gemini performance for race counterfactuals e) Gemini performance for gender counterfactuals. Legend: S=symptoms(general and specific), gS=general symptoms, sS= specific symptoms, L=location, A=attribute (age and gender) Error bars are 90% confidence interval. *=p<0.025
  • Figure 2: Per disease performance for LLMs and human experts, a) LLM performance on original persona with different contextual combinations (5 repeated runs), b) LLM performance on location counterfactual with different contextual combinations (5 repeated runs), c) Human expert performance (top 5 out of 7 experts). Error bars are 90% confidence interval. Legend: S=symptoms(general and specific), gS=general symptoms, sS= specific symptoms, L=location, A=attribute (age and gender), R=risk factor, FP=full persona, Exp_Tot=total expert score, Exp_Maj = expert majority score, Exp_Any = expert any/at least one score, Exp_All = Expert all score.
  • Figure 3: Model performance on expanded dataset. a) LLM performance on demographic clinical and consumer augmentations (2635 each) and b)LLM performance on semantic clinical and consumer augmentations (2651 each). We compared the base model, with the multi-shot tuned model. ****=p<0.00005
  • Figure 4: Expert baseline and data quality rating. a)Expert baseline compared to LLM, b) Expert rating of data quality, c) Expert rating of helpfulness of contextual information. Error bars are 90% CI. *=p<0.05 ****=p<0.00005
  • Figure 5: TRINDs research tool showing user entry
  • ...and 12 more figures