Table of Contents
Fetching ...

GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images

Xiang Lan, Feng Wu, Kai He, Qinghao Zhao, Shenda Hong, Mengling Feng

TL;DR

GEM addresses two major gaps in ECG interpretation by unifying time-series data, 12-lead ECG images, and text within a single multilingual framework to produce grounded and clinician-aligned explanations.The approach combines a dual-encoder architecture, cross-modal alignment into a shared textual space, and knowledge-guided instruction data generation to create heartbeat-level, feature-grounded analyses.Empirical results show GEM outperforms prior models on in-domain and out-domain benchmarks, with strong improvements in diagnosis accuracy, explainability, and grounding, and cardiologist evaluations confirming clinical usefulness and reliability.The work introduces the ECG-Grounding dataset and the Grounded ECG Understanding task, providing resources and evaluation protocols to advance trustworthy, interpretable conversational AI for cardiac care.

Abstract

While recent multimodal large language models (MLLMs) have advanced automated ECG interpretation, they still face two key limitations: (1) insufficient multimodal synergy between time series signals and visual ECG representations, and (2) limited explainability in linking diagnoses to granular waveform evidence. We introduce GEM, the first MLLM unifying ECG time series, 12-lead ECG images and text for grounded and clinician-aligned ECG interpretation. GEM enables feature-grounded analysis, evidence-driven reasoning, and a clinician-like diagnostic process through three core innovations: a dual-encoder framework extracting complementary time series and image features, cross-modal alignment for effective multimodal understanding, and knowledge-guided instruction generation for generating high-granularity grounding data (ECG-Grounding) linking diagnoses to measurable parameters ($e.g.$, QRS/PR Intervals). Additionally, we propose the Grounded ECG Understanding task, a clinically motivated benchmark designed to comprehensively assess the MLLM's capability in grounded ECG understanding. Experimental results on both existing and our proposed benchmarks show GEM significantly improves predictive performance (CSN $7.4\% \uparrow$), explainability ($22.7\% \uparrow$), and grounding ($24.8\% \uparrow$), making it more suitable for real-world clinical applications. GitHub repository: https://github.com/lanxiang1017/GEM.git

GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images

TL;DR

GEM addresses two major gaps in ECG interpretation by unifying time-series data, 12-lead ECG images, and text within a single multilingual framework to produce grounded and clinician-aligned explanations.The approach combines a dual-encoder architecture, cross-modal alignment into a shared textual space, and knowledge-guided instruction data generation to create heartbeat-level, feature-grounded analyses.Empirical results show GEM outperforms prior models on in-domain and out-domain benchmarks, with strong improvements in diagnosis accuracy, explainability, and grounding, and cardiologist evaluations confirming clinical usefulness and reliability.The work introduces the ECG-Grounding dataset and the Grounded ECG Understanding task, providing resources and evaluation protocols to advance trustworthy, interpretable conversational AI for cardiac care.

Abstract

While recent multimodal large language models (MLLMs) have advanced automated ECG interpretation, they still face two key limitations: (1) insufficient multimodal synergy between time series signals and visual ECG representations, and (2) limited explainability in linking diagnoses to granular waveform evidence. We introduce GEM, the first MLLM unifying ECG time series, 12-lead ECG images and text for grounded and clinician-aligned ECG interpretation. GEM enables feature-grounded analysis, evidence-driven reasoning, and a clinician-like diagnostic process through three core innovations: a dual-encoder framework extracting complementary time series and image features, cross-modal alignment for effective multimodal understanding, and knowledge-guided instruction generation for generating high-granularity grounding data (ECG-Grounding) linking diagnoses to measurable parameters (, QRS/PR Intervals). Additionally, we propose the Grounded ECG Understanding task, a clinically motivated benchmark designed to comprehensively assess the MLLM's capability in grounded ECG understanding. Experimental results on both existing and our proposed benchmarks show GEM significantly improves predictive performance (CSN ), explainability (), and grounding (), making it more suitable for real-world clinical applications. GitHub repository: https://github.com/lanxiang1017/GEM.git

Paper Structure

This paper contains 23 sections, 10 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: GEM offers superior granularity in ECG interpretation compared to state-of-the-art models and human-written reports.
  • Figure 2: GEM's Architecture. Multimodal Encoding: Separate encoders process ECG time series and images to generate modality-specific representations, enabling a holistic analysis of ECG data. Cross-modal Alignment Learning: Time series and image representations are first aligned and then mapped to a textual space using a shared projector, ensuring coherent understanding for the LLM. Knowledge-guided Instruction Data Generation: Physiological features extracted from all 12 leads are sequenced and structured using a diagnosis guider, which prompts GPT-4o with domain-specific instructions to generate high-granularity instructional data.
  • Figure 3: Comparison of ECG-Instruct and our ECG-Grounding.
  • Figure 4: Cardiologist Evaluations. Blue: Findings exceeding expert expectations. Yellow: Findings with differing expert opinions.
  • Figure 5: Cardiologist Evaluations. Blue: Findings exceeding expert expectations. Yellow: Findings with differing expert opinions.
  • ...and 4 more figures