Table of Contents
Fetching ...

Position: Evaluation of ECG Representations Must Be Fixed

Zachary Berger, Daniel Prakah-Asante, John Guttag, Collin M. Stultz

TL;DR

It is argued that downstream evaluation should expand to include an assessment of structural heart disease and patient-level forecasting, in addition to other evolving ECG-related endpoints, as relevant clinical targets, to ensure progress is reliable and aligned with clinically meaningful objectives.

Abstract

This position paper argues that current benchmarking practice in 12-lead ECG representation learning must be fixed to ensure progress is reliable and aligned with clinically meaningful objectives. The field has largely converged on three public multi-label benchmarks (PTB-XL, CPSC2018, CSN) dominated by arrhythmia and waveform-morphology labels, even though the ECG is known to encode substantially broader clinical information. We argue that downstream evaluation should expand to include an assessment of structural heart disease and patient-level forecasting, in addition to other evolving ECG-related endpoints, as relevant clinical targets. Next, we outline evaluation best practices for multi-label, imbalanced settings, and show that when they are applied, the literature's current conclusion about which representations perform best is altered. Furthermore, we demonstrate the surprising result that a randomly initialized encoder with linear evaluation matches state-of-the-art pre-training on many tasks. This motivates the use of a random encoder as a reasonable baseline model. We substantiate our observations with an empirical evaluation of three representative ECG pre-training approaches across six evaluation settings: the three standard benchmarks, a structural disease dataset, hemodynamic inference, and patient forecasting.

Position: Evaluation of ECG Representations Must Be Fixed

TL;DR

It is argued that downstream evaluation should expand to include an assessment of structural heart disease and patient-level forecasting, in addition to other evolving ECG-related endpoints, as relevant clinical targets, to ensure progress is reliable and aligned with clinically meaningful objectives.

Abstract

This position paper argues that current benchmarking practice in 12-lead ECG representation learning must be fixed to ensure progress is reliable and aligned with clinically meaningful objectives. The field has largely converged on three public multi-label benchmarks (PTB-XL, CPSC2018, CSN) dominated by arrhythmia and waveform-morphology labels, even though the ECG is known to encode substantially broader clinical information. We argue that downstream evaluation should expand to include an assessment of structural heart disease and patient-level forecasting, in addition to other evolving ECG-related endpoints, as relevant clinical targets. Next, we outline evaluation best practices for multi-label, imbalanced settings, and show that when they are applied, the literature's current conclusion about which representations perform best is altered. Furthermore, we demonstrate the surprising result that a randomly initialized encoder with linear evaluation matches state-of-the-art pre-training on many tasks. This motivates the use of a random encoder as a reasonable baseline model. We substantiate our observations with an empirical evaluation of three representative ECG pre-training approaches across six evaluation settings: the three standard benchmarks, a structural disease dataset, hemodynamic inference, and patient forecasting.
Paper Structure (33 sections, 4 equations, 1 figure, 28 tables)

This paper contains 33 sections, 4 equations, 1 figure, 28 tables.

Figures (1)

  • Figure 1: Overview of the evaluation pipeline for 12-lead ECG representations. (a) Current practice focuses on arrhythmia/waveform tasks and macro-AUROC point estimates, which can produce misleading method rankings. (b) We propose a broader set of clinically relevant tasks and evaluation best-practices that more reliably assess methods. We find that no method consistently prevails and for many tasks, many methods overlap with the baseline of a randomly initialized encoder.