Large Language Model Benchmarks in Medical Tasks

Lawrence K. Q. Yan; Qian Niu; Ming Li; Yichao Zhang; Caitlyn Heqi Yin; Cheng Fei; Benji Peng; Ziqian Bi; Pohsun Feng; Keyu Chen; Tianyang Wang; Yunze Wang; Silin Chen; Ming Liu; Junyu Liu; Xinyuan Song; Riyang Bao; Zekun Jiang; Ziyuan Qin

Large Language Model Benchmarks in Medical Tasks

Lawrence K. Q. Yan, Qian Niu, Ming Li, Yichao Zhang, Caitlyn Heqi Yin, Cheng Fei, Benji Peng, Ziqian Bi, Pohsun Feng, Keyu Chen, Tianyang Wang, Yunze Wang, Silin Chen, Ming Liu, Junyu Liu, Xinyuan Song, Riyang Bao, Zekun Jiang, Ziyuan Qin

TL;DR

This survey analyzes benchmark datasets for medical large language models across text, image, video, audio, ECG, and omics modalities. It classifies benchmarks into discriminative and generative tasks, highlighting datasets such as MIMIC-III/MIMIC-IV, BioASQ, PubMedQA, MedQA, MedMCQA, and CheXpert, and discusses their roles in clinical NLP applications like report generation, clinical summarization, and diagnostic reasoning. The authors identify challenges including language diversity, limited large-scale medical image-caption data, and integration of omics data, proposing synthetic data and multimodal augmentation as paths forward. By mapping a broad landscape of datasets and tasks, the paper provides a foundation for standardized evaluation and future development of multimodal medical AI with real-world clinical impact.

Abstract

With the increasing application of large language models (LLMs) in the medical domain, evaluating these models' performance using benchmark datasets has become crucial. This paper presents a comprehensive survey of various benchmark datasets employed in medical LLM tasks. These datasets span multiple modalities including text, image, and multimodal benchmarks, focusing on different aspects of medical knowledge such as electronic health records (EHRs), doctor-patient dialogues, medical question-answering, and medical image captioning. The survey categorizes the datasets by modality, discussing their significance, data structure, and impact on the development of LLMs for clinical tasks such as diagnosis, report generation, and predictive decision support. Key benchmarks include MIMIC-III, MIMIC-IV, BioASQ, PubMedQA, and CheXpert, which have facilitated advancements in tasks like medical report generation, clinical summarization, and synthetic data generation. The paper summarizes the challenges and opportunities in leveraging these benchmarks for advancing multimodal medical intelligence, emphasizing the need for datasets with a greater degree of language diversity, structured omics data, and innovative approaches to synthesis. This work also provides a foundation for future research in the application of LLMs in medicine, contributing to the evolving field of medical artificial intelligence.

Large Language Model Benchmarks in Medical Tasks

TL;DR

Abstract

Large Language Model Benchmarks in Medical Tasks

TL;DR

Abstract

Paper Structure

Table of Contents