Table of Contents
Fetching ...

EchoQA: A Large Collection of Instruction Tuning Data for Echocardiogram Reports

Lama Moukheiber, Mira Moukheiber, Dana Moukheiiber, Jae-Woo Ju, Hyung-Chul Lee

TL;DR

This work addresses the lack of real-world, echocardiography-focused QA data for training and evaluating LLMs in cardiology. It introduces EchoQA, a large-scale QA dataset derived from MIMIC-IV echocardiogram reports, and demonstrates that instruction-tuned fine-tuning on diverse LLMs yields superior QA performance compared to zero-shot and few-shot baselines. Clinician evaluation confirms high correctness for the top model, Echo-Mistral, while fairness audits reveal mixed bias across social determinants of health, informing risk-aware deployment. By releasing Echo-Mistral and establishing a comprehensive, multimodel benchmark, the paper provides a practical resource to advance AI-assisted cardiac differential diagnosis and reduce clinician documentation burden.

Abstract

We introduce a novel question-answering (QA) dataset using echocardiogram reports sourced from the Medical Information Mart for Intensive Care database. This dataset is specifically designed to enhance QA systems in cardiology, consisting of 771,244 QA pairs addressing a wide array of cardiac abnormalities and their severity. We compare large language models (LLMs), including open-source and biomedical-specific models for zero-shot evaluation, and closed-source models for zero-shot and three-shot evaluation. Our results show that fine-tuning LLMs improves performance across various QA metrics, validating the value of our dataset. Clinicians also qualitatively evaluate the best-performing model to assess the LLM responses for correctness. Further, we conduct fine-grained fairness audits to assess the bias-performance trade-off of LLMs across various social determinants of health. Our objective is to propel the field forward by establishing a benchmark for LLM AI agents aimed at supporting clinicians with cardiac differential diagnoses, thereby reducing the documentation burden that contributes to clinician burnout and enabling healthcare professionals to focus more on patient care.

EchoQA: A Large Collection of Instruction Tuning Data for Echocardiogram Reports

TL;DR

This work addresses the lack of real-world, echocardiography-focused QA data for training and evaluating LLMs in cardiology. It introduces EchoQA, a large-scale QA dataset derived from MIMIC-IV echocardiogram reports, and demonstrates that instruction-tuned fine-tuning on diverse LLMs yields superior QA performance compared to zero-shot and few-shot baselines. Clinician evaluation confirms high correctness for the top model, Echo-Mistral, while fairness audits reveal mixed bias across social determinants of health, informing risk-aware deployment. By releasing Echo-Mistral and establishing a comprehensive, multimodel benchmark, the paper provides a practical resource to advance AI-assisted cardiac differential diagnosis and reduce clinician documentation burden.

Abstract

We introduce a novel question-answering (QA) dataset using echocardiogram reports sourced from the Medical Information Mart for Intensive Care database. This dataset is specifically designed to enhance QA systems in cardiology, consisting of 771,244 QA pairs addressing a wide array of cardiac abnormalities and their severity. We compare large language models (LLMs), including open-source and biomedical-specific models for zero-shot evaluation, and closed-source models for zero-shot and three-shot evaluation. Our results show that fine-tuning LLMs improves performance across various QA metrics, validating the value of our dataset. Clinicians also qualitatively evaluate the best-performing model to assess the LLM responses for correctness. Further, we conduct fine-grained fairness audits to assess the bias-performance trade-off of LLMs across various social determinants of health. Our objective is to propel the field forward by establishing a benchmark for LLM AI agents aimed at supporting clinicians with cardiac differential diagnoses, thereby reducing the documentation burden that contributes to clinician burnout and enabling healthcare professionals to focus more on patient care.

Paper Structure

This paper contains 11 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Workflow of the methodology.
  • Figure 2: Categorization of cardiac abnormalities. X represents a specific cardiac abnormality. a) The schema includes the following cardiac abnormalities: right atrial pressure; tricuspid valve regurgitation, tricuspid valve stenosis, and pulmonary hypertension; right ventricular systolic function, right ventricular cavity, and right ventricular wall; left atrial cavity; mitral valve regurgitation and mitral valve stenosis; left ventricular systolic function, left ventricular cavity, left ventricular wall, left ventricular diastolic function, left ventricular outflow tract obstruction, and left regional wall motion abnormality; and aortic valve regurgitation and aortic valve stenosis. b) The schema includes other right ventricular and atrial abnormalities: right ventricular pressure overload and right ventricular volume overload; and right atrial enlargement.
  • Figure 3: Disparities in performance depicted by F1 and standard error over 3 runs between different groups (high, upper middle, lower middle, low) along the social determinants of health by each examined open-sourced biomedical LLM.
  • Figure 4: Disparities in performance depicted by F1 and standard error over 3 runs between different groups (high, upper middle, lower middle, low) along the social determinants of health by each examined open-sourced general LLM.
  • Figure 5: Disparities in performance depicted by F1 and standard error over 3 runs between different groups (high, upper middle, lower middle, low) along the social determinants of health by each examined closed-sourced general LLM.
  • ...and 2 more figures