Evaluating Large Language Models for Public Health Classification and Extraction Tasks

Joshua Harris; Timothy Laurence; Leo Loman; Fan Grayson; Toby Nonnenmacher; Harry Long; Loes WalsGriffith; Amy Douglas; Holly Fountain; Stelios Georgiou; Jo Hardstaff; Kathryn Hopkins; Y-Ling Chi; Galena Kuyumdzhieva; Lesley Larkin; Samuel Collins; Hamish Mohammed; Thomas Finnie; Luke Hounsome; Michael Borowitz; Steven Riley

Evaluating Large Language Models for Public Health Classification and Extraction Tasks

Joshua Harris, Timothy Laurence, Leo Loman, Fan Grayson, Toby Nonnenmacher, Harry Long, Loes WalsGriffith, Amy Douglas, Holly Fountain, Stelios Georgiou, Jo Hardstaff, Kathryn Hopkins, Y-Ling Chi, Galena Kuyumdzhieva, Lesley Larkin, Samuel Collins, Hamish Mohammed, Thomas Finnie, Luke Hounsome, Michael Borowitz, Steven Riley

TL;DR

Problem: determining how well open-weight LLMs perform on public health free-text classification and extraction tasks. Approach: automated domain-specific evaluation across 16 tasks in burden, risk factors, and interventions using 11 open-weight LLMs with zero-shot prompting, plus GPT-4 comparisons; performance measured primarily by $micro\text{-}F1$ and $macro\text{-}F1$, with INT-4 AWQ quantization experiments. Findings: Llama-3.3-70B-Instruct often leads among open-weight models (8/16 tasks); many tasks exceed $0.80$ $micro\text{-}F1$, while some tasks (e.g., BioDex Drugs Extraction, Contact Classification) remain challenging; few-shot prompting yields major gains on hard tasks; GPT-4 series can match or exceed open-weight models on subset, with comparable overall performance in several cases. Significance: demonstrates promising potential for open-weight LLMs to support public health surveillance, research, and interventions, while highlighting task-dependent limitations and the need for careful validation, domain-specific annotation, and advanced prompting pipelines for robust real-world deployment.

Abstract

Advances in Large Language Models (LLMs) have led to significant interest in their potential to support human experts across a range of domains, including public health. In this work we present automated evaluations of LLMs for public health tasks involving the classification and extraction of free text. We combine six externally annotated datasets with seven new internally annotated datasets to evaluate LLMs for processing text related to: health burden, epidemiological risk factors, and public health interventions. We evaluate eleven open-weight LLMs (7-123 billion parameters) across all tasks using zero-shot in-context learning. We find that Llama-3.3-70B-Instruct is the highest performing model, achieving the best results on 8/16 tasks (using micro-F1 scores). We see significant variation across tasks with all open-weight LLMs scoring below 60% micro-F1 on some challenging tasks, such as Contact Classification, while all LLMs achieve greater than 80% micro-F1 on others, such as GI Illness Classification. For a subset of 11 tasks, we also evaluate three GPT-4 and GPT-4o series models and find comparable results to Llama-3.3-70B-Instruct. Overall, based on these initial results we find promising signs that LLMs may be useful tools for public health experts to extract information from a wide variety of free text sources, and support public health surveillance, research, and interventions.

Evaluating Large Language Models for Public Health Classification and Extraction Tasks

TL;DR

and

, with INT-4 AWQ quantization experiments. Findings: Llama-3.3-70B-Instruct often leads among open-weight models (8/16 tasks); many tasks exceed

, while some tasks (e.g., BioDex Drugs Extraction, Contact Classification) remain challenging; few-shot prompting yields major gains on hard tasks; GPT-4 series can match or exceed open-weight models on subset, with comparable overall performance in several cases. Significance: demonstrates promising potential for open-weight LLMs to support public health surveillance, research, and interventions, while highlighting task-dependent limitations and the need for careful validation, domain-specific annotation, and advanced prompting pipelines for robust real-world deployment.

Abstract

Paper Structure (33 sections, 8 figures, 7 tables)

This paper contains 33 sections, 8 figures, 7 tables.

Introduction
Methods
Public Health Evaluation Tasks and Datasets
Burden
Risk Factors
Interventions
Summary
Evaluation Methodology
Large Language Models
Prompting
Sampling
Dataset Splits
Evaluation Metrics
Results
Model Results
...and 18 more sections

Figures (8)

Figure 1: Public Health Large Language Model (LLM) Evaluation Areas [Number of Evaluations] and Task Evaluation Micro-F1 Scores by Model. (Left) We divide public health free text processing into three sub-domains: (1) burden, such as reports of disease symptoms, cases, morbidity, or mortality; (2) risk factors, such as environmental, behavioural, or biological contributors; (3) interventions, pharmaceutical and non-pharmaceutical. (Right) Evaluation results (micro-F1 scores) for seven open-weight LLM architectures (highest performing model evaluated from each) across the 16 tasks using zero-shot prompting.
Figure 2: Evaluation Tasks by Public Health Area. A summary of different tasks which we use to evaluate the LLMs, grouped by public health area. See \ref{['Burden']}, \ref{['risk_factors']}, and \ref{['interventions']} for full descriptions.
Figure 3: Comparison of zero-shot and few-shot prompting on challenging tasks (Contact Classification). We compare the baseline zero-shot prompt to a 10-shot prompt for Contact Classification.
Figure 4: Comparison of zero-shot and few-shot prompting on challenging tasks (Health Causal Claims Classification). We compare the baseline zero-shot prompt to a 7-shot prompt for Health Causal Claims Classification.
Figure 5: LLM Evaluation Spectrum. In this paper we focus on a combination of domain and task specific LLM evaluations within public health in order to inform our understanding of where LLMs may be successfully deployed within the field. The area marked by the (*) - denotes how our evaluations compare to others.
...and 3 more figures

Evaluating Large Language Models for Public Health Classification and Extraction Tasks

TL;DR

Abstract

Evaluating Large Language Models for Public Health Classification and Extraction Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (8)