ECG-Expert-QA: A Benchmark for Evaluating Medical Large Language Models in Heart Disease Diagnosis

Xu Wang; Jiaju Kang; Puyu Han; Yubao Zhao; Qian Liu; Liwenfei He; Lingqiong Zhang; Lingyun Dai; Yongcheng Wang; Jie Tao

ECG-Expert-QA: A Benchmark for Evaluating Medical Large Language Models in Heart Disease Diagnosis

Xu Wang, Jiaju Kang, Puyu Han, Yubao Zhao, Qian Liu, Liwenfei He, Lingqiong Zhang, Lingyun Dai, Yongcheng Wang, Jie Tao

TL;DR

ECG-Expert-QA introduces a multimodal, open-source benchmark for evaluating medical LLMs in heart-disease diagnosis by combining real ECG data with synthetic cases across 12 tasks, totaling 47,211 QA pairs and enabling multi-turn dialogue evaluation. A three-pronged data-generation framework—expert-guided knowledge assessment, cross-modal ECG-to-text reasoning, and medical risk assessment—underpins diverse knowledge, diagnostic reasoning, and ethical content. Evaluation with BLEU-1, ROUGE-L, METEOR, and a Model-to-Model Scoring metric across GPT-4o, DeepSeek-V3, Qwen2.5, and a lightweight MiniMind2 reveals strengths and trade-offs between lexical precision and semantic depth, with large models excelling in complex reasoning and ethics, and smaller models showing lexical strength in knowledge tasks. The benchmark advances trustworthy, conversational ECG AI and opens avenues for real-time, longitudinal, and cross-lingual clinical AI research.

Abstract

We present ECG-Expert-QA, a comprehensive multimodal dataset for evaluating diagnostic capabilities in electrocardiogram (ECG) interpretation. It combines real-world clinical ECG data with systematically generated synthetic cases, covering 12 essential diagnostic tasks and totaling 47,211 expert-validated QA pairs. These encompass diverse clinical scenarios, from basic rhythm recognition to complex diagnoses involving rare conditions and temporal changes. A key innovation is the support for multi-turn dialogues, enabling the development of conversational medical AI systems that emulate clinician-patient or interprofessional interactions. This allows for more realistic assessment of AI models' clinical reasoning, diagnostic accuracy, and knowledge integration. Constructed through a knowledge-guided framework with strict quality control, ECG-Expert-QA ensures linguistic and clinical consistency, making it a high-quality resource for advancing AI-assisted ECG interpretation. It challenges models with tasks like identifying subtle ischemic changes and interpreting complex arrhythmias in context-rich scenarios. To promote research transparency and collaboration, the dataset, accompanying code, and prompts are publicly released at https://github.com/Zaozzz/ECG-Expert-QA

ECG-Expert-QA: A Benchmark for Evaluating Medical Large Language Models in Heart Disease Diagnosis

TL;DR

Abstract

ECG-Expert-QA: A Benchmark for Evaluating Medical Large Language Models in Heart Disease Diagnosis

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)