Table of Contents
Fetching ...

Can LLMs Help Allocate Public Health Resources? A Case Study on Childhood Lead Testing

Mohamed Afane, Ying Wang, Juntao Chen

TL;DR

The study tackles how to allocate limited public health resources for childhood lead testing by constructing a Priority Score that combines lead prevalence, testing gaps, and health coverage patterns across 136 neighborhoods in Chicago, NYC, and DC. It then quantifies whether state-of-the-art LLMs with agentic reasoning and deep research modes can autonomously allocate 1,000 test kits per city, finding substantial limitations with average accuracy around 0.46 and best ~0.66; common failures include neglecting the highest-risk neighborhoods and overemphasizing less vulnerable areas. A key finding is the strong cross-city association between public health coverage and lead vulnerability, justifying the Priority Score as a practical, data-driven framework for targeted interventions that still requires human validation. Overall, the results reveal that while LLMs hold promise for assisting public health decision-making, current capabilities are insufficient for autonomous, policy-level resource allocation without rigorous data integration and oversight.

Abstract

Public health agencies face critical challenges in identifying high-risk neighborhoods for childhood lead exposure with limited resources for outreach and intervention programs. To address this, we develop a Priority Score integrating untested children proportions, elevated blood lead prevalence, and public health coverage patterns to support optimized resource allocation decisions across 136 neighborhoods in Chicago, New York City, and Washington, D.C. We leverage these allocation tasks, which require integrating multiple vulnerability indicators and interpreting empirical evidence, to evaluate whether large language models (LLMs) with agentic reasoning and deep research capabilities can effectively allocate public health resources when presented with structured allocation scenarios. LLMs were tasked with distributing 1,000 test kits within each city based on neighborhood vulnerability indicators. Results reveal significant limitations: LLMs frequently overlooked neighborhoods with highest lead prevalence and largest proportions of untested children, such as West Englewood in Chicago, while allocating disproportionate resources to lower-priority areas like Hunts Point in New York City. Overall accuracy averaged 0.46, reaching a maximum of 0.66 with ChatGPT 5 Deep Research. Despite their marketed deep research capabilities, LLMs struggled with fundamental limitations in information retrieval and evidence-based reasoning, frequently citing outdated data and allowing non-empirical narratives about neighborhood conditions to override quantitative vulnerability indicators.

Can LLMs Help Allocate Public Health Resources? A Case Study on Childhood Lead Testing

TL;DR

The study tackles how to allocate limited public health resources for childhood lead testing by constructing a Priority Score that combines lead prevalence, testing gaps, and health coverage patterns across 136 neighborhoods in Chicago, NYC, and DC. It then quantifies whether state-of-the-art LLMs with agentic reasoning and deep research modes can autonomously allocate 1,000 test kits per city, finding substantial limitations with average accuracy around 0.46 and best ~0.66; common failures include neglecting the highest-risk neighborhoods and overemphasizing less vulnerable areas. A key finding is the strong cross-city association between public health coverage and lead vulnerability, justifying the Priority Score as a practical, data-driven framework for targeted interventions that still requires human validation. Overall, the results reveal that while LLMs hold promise for assisting public health decision-making, current capabilities are insufficient for autonomous, policy-level resource allocation without rigorous data integration and oversight.

Abstract

Public health agencies face critical challenges in identifying high-risk neighborhoods for childhood lead exposure with limited resources for outreach and intervention programs. To address this, we develop a Priority Score integrating untested children proportions, elevated blood lead prevalence, and public health coverage patterns to support optimized resource allocation decisions across 136 neighborhoods in Chicago, New York City, and Washington, D.C. We leverage these allocation tasks, which require integrating multiple vulnerability indicators and interpreting empirical evidence, to evaluate whether large language models (LLMs) with agentic reasoning and deep research capabilities can effectively allocate public health resources when presented with structured allocation scenarios. LLMs were tasked with distributing 1,000 test kits within each city based on neighborhood vulnerability indicators. Results reveal significant limitations: LLMs frequently overlooked neighborhoods with highest lead prevalence and largest proportions of untested children, such as West Englewood in Chicago, while allocating disproportionate resources to lower-priority areas like Hunts Point in New York City. Overall accuracy averaged 0.46, reaching a maximum of 0.66 with ChatGPT 5 Deep Research. Despite their marketed deep research capabilities, LLMs struggled with fundamental limitations in information retrieval and evidence-based reasoning, frequently citing outdated data and allowing non-empirical narratives about neighborhood conditions to override quantitative vulnerability indicators.

Paper Structure

This paper contains 17 sections, 1 equation, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: Comparison of LLM resource allocation recommendations for lead testing kits against empirical vulnerability rankings. The left panel shows the five highest-priority neighborhoods based on historic data and vulnerability metrics. The right panels display top-three allocations from ChatGPT, Claude, and Gemini for Chicago. Green highlights indicate correct identification of high-priority areas, while red highlights with crosses mark lower-priority neighborhoods incorrectly placed in the top three.
  • Figure 2: Correlation of Vulnerability with Health Coverage Types across Chicago, New York City, and Washington, D.C. Public health coverage is generally associated with higher vulnerability indices (average correlation: 0.50), while private health coverage shows an inverse correlation (average correlation: -0.54), as indicated by the dotted lines.
  • Figure 3: Maps of Chicago, New York City, and Washington D.C., showing neighborhoods with the highest Priority Score values for targeted interventions. These highlighted areas represent regions with the greatest need for enhanced lead testing and remediation efforts, particularly in neighborhoods with a high concentration of public health coverage.