Position: Evaluation of ECG Representations Must Be Fixed

Zachary Berger; Daniel Prakah-Asante; John Guttag; Collin M. Stultz

Position: Evaluation of ECG Representations Must Be Fixed

Zachary Berger, Daniel Prakah-Asante, John Guttag, Collin M. Stultz

TL;DR

It is argued that downstream evaluation should expand to include an assessment of structural heart disease and patient-level forecasting, in addition to other evolving ECG-related endpoints, as relevant clinical targets, to ensure progress is reliable and aligned with clinically meaningful objectives.

Abstract

This position paper argues that current benchmarking practice in 12-lead ECG representation learning must be fixed to ensure progress is reliable and aligned with clinically meaningful objectives. The field has largely converged on three public multi-label benchmarks (PTB-XL, CPSC2018, CSN) dominated by arrhythmia and waveform-morphology labels, even though the ECG is known to encode substantially broader clinical information. We argue that downstream evaluation should expand to include an assessment of structural heart disease and patient-level forecasting, in addition to other evolving ECG-related endpoints, as relevant clinical targets. Next, we outline evaluation best practices for multi-label, imbalanced settings, and show that when they are applied, the literature's current conclusion about which representations perform best is altered. Furthermore, we demonstrate the surprising result that a randomly initialized encoder with linear evaluation matches state-of-the-art pre-training on many tasks. This motivates the use of a random encoder as a reasonable baseline model. We substantiate our observations with an empirical evaluation of three representative ECG pre-training approaches across six evaluation settings: the three standard benchmarks, a structural disease dataset, hemodynamic inference, and patient forecasting.

Position: Evaluation of ECG Representations Must Be Fixed

TL;DR

Abstract

Paper Structure (33 sections, 4 equations, 1 figure, 28 tables)

This paper contains 33 sections, 4 equations, 1 figure, 28 tables.

Introduction
Related Work
Extending beyond current benchmarks to more holistic clinical applications
Toward evaluation best-practices that reliably stratify representation quality
Empirical Study
Pre-training Configuration
Downstream Tasks
Evaluation Protocol
Experimental Results
Evaluation on PTB-XL, CPSC2018, CSN
Sensitivity of Macro-Averaged Metrics to Labels with Few Examples
Evaluation on EchoNext
Hemodynamic Inference
Patient Forecasting
Alternative Views
...and 18 more sections

Figures (1)

Figure 1: Overview of the evaluation pipeline for 12-lead ECG representations. (a) Current practice focuses on arrhythmia/waveform tasks and macro-AUROC point estimates, which can produce misleading method rankings. (b) We propose a broader set of clinically relevant tasks and evaluation best-practices that more reliably assess methods. We find that no method consistently prevails and for many tasks, many methods overlap with the baseline of a randomly initialized encoder.

Position: Evaluation of ECG Representations Must Be Fixed

TL;DR

Abstract

Position: Evaluation of ECG Representations Must Be Fixed

Authors

TL;DR

Abstract

Table of Contents

Figures (1)