Table of Contents
Fetching ...

PediaBench: A Comprehensive Chinese Pediatric Dataset for Benchmarking Large Language Models

Qian Zhang, Panfeng Chen, Jiali Li, Linkun Feng, Shuyu Liu, Heng Zhao, Mei Chen, Hui Li, Yanhao Wang

TL;DR

PediaBench introduces the first Chinese pediatric QA benchmark tailored for LLM evaluation, combining 4,117 objective and 1,632 subjective questions across 12 disease groups with an integrated scoring framework that blends difficulty-based objective scoring and LME-based subjective scoring. The dataset aggregates sources from CNMLE, final medical exams, standards/guidelines, and clinical practice, emphasizing both knowledge understanding and generation capabilities through five question types. Extensive experiments across 20 LLMs reveal a substantive gap between current models and expert clinical performance, with particular strength in objective correctness but notable weaknesses in case analysis and generation quality, underscoring the need for medical knowledge injection and retrieval-augmented approaches in pediatrics. The work provides a publicly available resource for ongoing pediatric LLM benchmarking and model improvement in the Chinese medical domain.

Abstract

The emergence of Large Language Models (LLMs) in the medical domain has stressed a compelling need for standard datasets to evaluate their question-answering (QA) performance. Although there have been several benchmark datasets for medical QA, they either cover common knowledge across different departments or are specific to another department rather than pediatrics. Moreover, some of them are limited to objective questions and do not measure the generation capacity of LLMs. Therefore, they cannot comprehensively assess the QA ability of LLMs in pediatrics. To fill this gap, we construct PediaBench, the first Chinese pediatric dataset for LLM evaluation. Specifically, it contains 4,117 objective questions and 1,632 subjective questions spanning 12 pediatric disease groups. It adopts an integrated scoring criterion based on different difficulty levels to thoroughly assess the proficiency of an LLM in instruction following, knowledge understanding, clinical case analysis, etc. Finally, we validate the effectiveness of PediaBench with extensive experiments on 20 open-source and commercial LLMs. Through an in-depth analysis of experimental results, we offer insights into the ability of LLMs to answer pediatric questions in the Chinese context, highlighting their limitations for further improvements. Our code and data are published at https://github.com/ACMISLab/PediaBench.

PediaBench: A Comprehensive Chinese Pediatric Dataset for Benchmarking Large Language Models

TL;DR

PediaBench introduces the first Chinese pediatric QA benchmark tailored for LLM evaluation, combining 4,117 objective and 1,632 subjective questions across 12 disease groups with an integrated scoring framework that blends difficulty-based objective scoring and LME-based subjective scoring. The dataset aggregates sources from CNMLE, final medical exams, standards/guidelines, and clinical practice, emphasizing both knowledge understanding and generation capabilities through five question types. Extensive experiments across 20 LLMs reveal a substantive gap between current models and expert clinical performance, with particular strength in objective correctness but notable weaknesses in case analysis and generation quality, underscoring the need for medical knowledge injection and retrieval-augmented approaches in pediatrics. The work provides a publicly available resource for ongoing pediatric LLM benchmarking and model improvement in the Chinese medical domain.

Abstract

The emergence of Large Language Models (LLMs) in the medical domain has stressed a compelling need for standard datasets to evaluate their question-answering (QA) performance. Although there have been several benchmark datasets for medical QA, they either cover common knowledge across different departments or are specific to another department rather than pediatrics. Moreover, some of them are limited to objective questions and do not measure the generation capacity of LLMs. Therefore, they cannot comprehensively assess the QA ability of LLMs in pediatrics. To fill this gap, we construct PediaBench, the first Chinese pediatric dataset for LLM evaluation. Specifically, it contains 4,117 objective questions and 1,632 subjective questions spanning 12 pediatric disease groups. It adopts an integrated scoring criterion based on different difficulty levels to thoroughly assess the proficiency of an LLM in instruction following, knowledge understanding, clinical case analysis, etc. Finally, we validate the effectiveness of PediaBench with extensive experiments on 20 open-source and commercial LLMs. Through an in-depth analysis of experimental results, we offer insights into the ability of LLMs to answer pediatric questions in the Chinese context, highlighting their limitations for further improvements. Our code and data are published at https://github.com/ACMISLab/PediaBench.

Paper Structure

This paper contains 34 sections, 1 equation, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Illustration of the overall framework of PediaBench.
  • Figure 2: Examples for different types of questions and their answers in PediaBench.
  • Figure 3: Statistics on the number of ToF and MC questions at different difficulty levels.
  • Figure 4: Illustrations of evaluation results for open-source LLMs.
  • Figure 5: Prompt for disease group classification and an exemplar response of GLM-4.
  • ...and 7 more figures