Table of Contents
Fetching ...

Can Large Language Models Function as Qualified Pediatricians? A Systematic Evaluation in Real-World Clinical Contexts

Siyu Zhu, Mouxiao Bian, Yue Xie, Yongyu Tang, Zhikang Yu, Tianbin Li, Pengcheng Chen, Bing Han, Jie Xu, Xiaoyan Dong

TL;DR

This study introduces PEDIASBench, a competency-based framework to evaluate large language models in pediatric care across foundational knowledge, dynamic diagnostic/therapeutic reasoning, and medical ethics/safety. By testing 12 diverse LLMs on 211 diseases spanning 19 subspecialties, the authors reveal strong recall for basic knowledge yet limited dynamic reasoning and humanistic care, with performance often declining as task complexity increases. Key findings show models excel at foundational questions but struggle with longitudinal, context-dependent decision-making; ethics and safety performance varies by task, lacking a universal strong performer. The work argues for multimodal integration, retrieval-augmented reasoning, and clinician-in-the-loop validation to enable safe, effective AI-assisted pediatric care and education, framing a path toward trustworthy human–AI collaboration in child health.

Abstract

With the rapid rise of large language models (LLMs) in medicine, a key question is whether they can function as competent pediatricians in real-world clinical settings. We developed PEDIASBench, a systematic evaluation framework centered on a knowledge-system framework and tailored to realistic clinical environments. PEDIASBench assesses LLMs across three dimensions: application of basic knowledge, dynamic diagnosis and treatment capability, and pediatric medical safety and medical ethics. We evaluated 12 representative models released over the past two years, including GPT-4o, Qwen3-235B-A22B, and DeepSeek-V3, covering 19 pediatric subspecialties and 211 prototypical diseases. State-of-the-art models performed well on foundational knowledge, with Qwen3-235B-A22B achieving over 90% accuracy on licensing-level questions, but performance declined ~15% as task complexity increased, revealing limitations in complex reasoning. Multiple-choice assessments highlighted weaknesses in integrative reasoning and knowledge recall. In dynamic diagnosis and treatment scenarios, DeepSeek-R1 scored highest in case reasoning (mean 0.58), yet most models struggled to adapt to real-time patient changes. On pediatric medical ethics and safety tasks, Qwen2.5-72B performed best (accuracy 92.05%), though humanistic sensitivity remained limited. These findings indicate that pediatric LLMs are constrained by limited dynamic decision-making and underdeveloped humanistic care. Future development should focus on multimodal integration and a clinical feedback-model iteration loop to enhance safety, interpretability, and human-AI collaboration. While current LLMs cannot independently perform pediatric care, they hold promise for decision support, medical education, and patient communication, laying the groundwork for a safe, trustworthy, and collaborative intelligent pediatric healthcare system.

Can Large Language Models Function as Qualified Pediatricians? A Systematic Evaluation in Real-World Clinical Contexts

TL;DR

This study introduces PEDIASBench, a competency-based framework to evaluate large language models in pediatric care across foundational knowledge, dynamic diagnostic/therapeutic reasoning, and medical ethics/safety. By testing 12 diverse LLMs on 211 diseases spanning 19 subspecialties, the authors reveal strong recall for basic knowledge yet limited dynamic reasoning and humanistic care, with performance often declining as task complexity increases. Key findings show models excel at foundational questions but struggle with longitudinal, context-dependent decision-making; ethics and safety performance varies by task, lacking a universal strong performer. The work argues for multimodal integration, retrieval-augmented reasoning, and clinician-in-the-loop validation to enable safe, effective AI-assisted pediatric care and education, framing a path toward trustworthy human–AI collaboration in child health.

Abstract

With the rapid rise of large language models (LLMs) in medicine, a key question is whether they can function as competent pediatricians in real-world clinical settings. We developed PEDIASBench, a systematic evaluation framework centered on a knowledge-system framework and tailored to realistic clinical environments. PEDIASBench assesses LLMs across three dimensions: application of basic knowledge, dynamic diagnosis and treatment capability, and pediatric medical safety and medical ethics. We evaluated 12 representative models released over the past two years, including GPT-4o, Qwen3-235B-A22B, and DeepSeek-V3, covering 19 pediatric subspecialties and 211 prototypical diseases. State-of-the-art models performed well on foundational knowledge, with Qwen3-235B-A22B achieving over 90% accuracy on licensing-level questions, but performance declined ~15% as task complexity increased, revealing limitations in complex reasoning. Multiple-choice assessments highlighted weaknesses in integrative reasoning and knowledge recall. In dynamic diagnosis and treatment scenarios, DeepSeek-R1 scored highest in case reasoning (mean 0.58), yet most models struggled to adapt to real-time patient changes. On pediatric medical ethics and safety tasks, Qwen2.5-72B performed best (accuracy 92.05%), though humanistic sensitivity remained limited. These findings indicate that pediatric LLMs are constrained by limited dynamic decision-making and underdeveloped humanistic care. Future development should focus on multimodal integration and a clinical feedback-model iteration loop to enhance safety, interpretability, and human-AI collaboration. While current LLMs cannot independently perform pediatric care, they hold promise for decision support, medical education, and patient communication, laying the groundwork for a safe, trustworthy, and collaborative intelligent pediatric healthcare system.

Paper Structure

This paper contains 32 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Pediatric Evaluation of Dynamic Intelligence, Adaptability, and Safety Benchmark
  • Figure 2: Accuracy comparison of large language models across four physician levels in single-choice tasks.
  • Figure 3: Performance of large language models across four physician levels in multiple-choice tasks.
  • Figure 4: Performance of large language models in dynamic diagnosis and treatment capability.
  • Figure 5: Comparative performance of large language models in pediatric medical safety and medical ethics.