LLM Assistance for Pediatric Depression
Mariia Ignashina, Paulina Bondaronek, Dan Santel, John Pestian, Julia Ive
TL;DR
This study tackles the challenge of pediatric depression screening in primary care, where PHQ-9 data are often sparse and inconsistently documented. It evaluates zero-shot LLMs (FLAN-T5, Llama-3, Phi) for extracting depressive-symptom evidence from free-text pediatric EHR notes (ages 6–24) and demonstrates how these extractions can serve as features for downstream ML-based screening. Flan-T5 achieves high precision (e.g., sleep problems, self-loathing), Phi provides a balanced precision-recall profile, and Llama-3 offers high recall with broader generalization, highlighting trade-offs between targeted accuracy and coverage. The findings indicate that LLM-driven symptom extraction can complement traditional screening, improve diagnostic consistency, and enable scalable, interpretable AI support for clinicians in pediatric mental health, though further validation and workflow integration are needed.
Abstract
Traditional depression screening methods, such as the PHQ-9, are particularly challenging for children in pediatric primary care due to practical limitations. AI has the potential to help, but the scarcity of annotated datasets in mental health, combined with the computational costs of training, highlights the need for efficient, zero-shot approaches. In this work, we investigate the feasibility of state-of-the-art LLMs for depressive symptom extraction in pediatric settings (ages 6-24). This approach aims to complement traditional screening and minimize diagnostic errors. Our findings show that all LLMs are 60% more efficient than word match, with Flan leading in precision (average F1: 0.65, precision: 0.78), excelling in the extraction of more rare symptoms like "sleep problems" (F1: 0.92) and "self-loathing" (F1: 0.8). Phi strikes a balance between precision (0.44) and recall (0.60), performing well in categories like "Feeling depressed" (0.69) and "Weight change" (0.78). Llama 3, with the highest recall (0.90), overgeneralizes symptoms, making it less suitable for this type of analysis. Challenges include the complexity of clinical notes and overgeneralization from PHQ-9 scores. The main challenges faced by LLMs include navigating the complex structure of clinical notes with content from different times in the patient trajectory, as well as misinterpreting elevated PHQ-9 scores. We finally demonstrate the utility of symptom annotations provided by Flan as features in an ML algorithm, which differentiates depression cases from controls with high precision of 0.78, showing a major performance boost compared to a baseline that does not use these features.
