Table of Contents
Fetching ...

How Well Do Multimodal Models Reason on ECG Signals?

Maxwell A. Xu, Harish Haresumadram, Catherine W. Liu, Patrick Langer, Jathurshan Pradeepkumar, Wanting Mao, Sunita J. Ferns, Aradhana Verma, Jimeng Sun, Paul Schmiedmayer, Xin Liu, Daniel McDuff, Emily B. Fox, James M. Rehg

TL;DR

This work introduces a reproducible framework for evaluating reasoning in ECG signals by decomposing reasoning into two distinct, components: Perception, the accurate identification of patterns within the raw signal, and Deduction, the logical application of domain knowledge to those patterns.

Abstract

While multimodal large language models offer a promising solution to the "black box" nature of health AI by generating interpretable reasoning traces, verifying the validity of these traces remains a critical challenge. Existing evaluation methods are either unscalable, relying on manual clinician review, or superficial, utilizing proxy metrics (e.g. QA) that fail to capture the semantic correctness of clinical logic. In this work, we introduce a reproducible framework for evaluating reasoning in ECG signals. We propose decomposing reasoning into two distinct, components: (i) Perception, the accurate identification of patterns within the raw signal, and (ii) Deduction, the logical application of domain knowledge to those patterns. To evaluate Perception, we employ an agentic framework that generates code to empirically verify the temporal structures described in the reasoning trace. To evaluate Deduction, we measure the alignment of the model's logic against a structured database of established clinical criteria in a retrieval-based approach. This dual-verification method enables the scalable assessment of "true" reasoning capabilities.

How Well Do Multimodal Models Reason on ECG Signals?

TL;DR

This work introduces a reproducible framework for evaluating reasoning in ECG signals by decomposing reasoning into two distinct, components: Perception, the accurate identification of patterns within the raw signal, and Deduction, the logical application of domain knowledge to those patterns.

Abstract

While multimodal large language models offer a promising solution to the "black box" nature of health AI by generating interpretable reasoning traces, verifying the validity of these traces remains a critical challenge. Existing evaluation methods are either unscalable, relying on manual clinician review, or superficial, utilizing proxy metrics (e.g. QA) that fail to capture the semantic correctness of clinical logic. In this work, we introduce a reproducible framework for evaluating reasoning in ECG signals. We propose decomposing reasoning into two distinct, components: (i) Perception, the accurate identification of patterns within the raw signal, and (ii) Deduction, the logical application of domain knowledge to those patterns. To evaluate Perception, we employ an agentic framework that generates code to empirically verify the temporal structures described in the reasoning trace. To evaluate Deduction, we measure the alignment of the model's logic against a structured database of established clinical criteria in a retrieval-based approach. This dual-verification method enables the scalable assessment of "true" reasoning capabilities.
Paper Structure (17 sections, 19 figures, 6 tables)

This paper contains 17 sections, 19 figures, 6 tables.

Figures (19)

  • Figure 1: ECG_ReasonEval Framework. We decompose reasoning evaluation into two independent axes: (i) Perception, verifying if reasoning is grounded in the signal, and (ii) Deduction, evaluating if the logic aligns with clinical consensus. For Perception, we employ a flexible data science agent that dynamically writes and executes Python code to verify specific findings. For Deduction, we query a medical criteria database with the reasoning trace and check if the retrieved criteria are tagged with the same label as the original signal's ground truth.
  • Figure 2: The Perception Pipeline. First, our data science agent extracts discrete, verifiable findings from the reasoning trace. It then generates executable Python code to empirically verify these claims against the raw ECG signal. This process is augmented by a segmentation tool, which lowers the code generation complexity by providing pre-computed wave delineations. The pipeline outputs a boolean verification status for each finding, determining if the reasoning describes the signal accurately.
  • Figure 3: Causes of Perception Failure. With our team of physicians, we manually examined the failed reasoning traces in the val set to categorize error sources. The results suggest Perception is highly reliable and capable of auditing errors in human annotations.
  • Figure 4: The Deduction Pipeline. The input reasoning trace is first censored of its final diagnostic label, embedded using the Gemini model, and used to query our database of diagnostic criteria. We retrieve the top-$k$ most semantically similar articles and calculate Precision@$k$ against the ground truth. A high score indicates that the reasoning trace accurately maps to the correct pathology according to established medical standards, mirroring how physicians cross-reference diagnostic criteria.
  • Figure 5: Summary of Baselines with ECG_ReasonEval on Perception and Deduction. We see all models perform much worse than the physician performance across both perception and deduction. While the TSLMs (OpenTSLM and QoQMed) that were trained with explicit time-series adapters often do better in perception to capture time-series features correctly, frontier VLMs like Claude Opus 4.5 Plot and Gemini 3.1 Plot are able to harness their world knowledge to align with the broader clinical literature, achieving stronger deduction performance. Most notably, Gemini establishes the strongest overall baseline across both perception and deduction metrics.
  • ...and 14 more figures