Table of Contents
Fetching ...

Empirical investigation of multi-source cross-validation in clinical ECG classification

Tuija Leinonen, David Wong, Antti Vasankari, Ali Wahab, Ramesh Nadarajah, Matti Kaisti, Antti Airola

TL;DR

K-fold cross-validation, both on single-source and multi-source data, systemically overestimates prediction performance when the end goal is to generalize to new sources, while leave-source-out cross-validation provides more reliable performance estimates, having close to zero bias though larger variability.

Abstract

Traditionally, machine learning-based clinical prediction models have been trained and evaluated on patient data from a single source, such as a hospital. Cross-validation methods can be used to estimate the accuracy of such models on new patients originating from the same source, by repeated random splitting of the data. However, such estimates tend to be highly overoptimistic when compared to accuracy obtained from deploying models to sources not represented in the dataset, such as a new hospital. The increasing availability of multi-source medical datasets provides new opportunities for obtaining more comprehensive and realistic evaluations of expected accuracy through source-level cross-validation designs. In this study, we present a systematic empirical evaluation of standard K-fold cross-validation and leave-source-out cross-validation methods in a multi-source setting. We consider the task of electrocardiogram based cardiovascular disease classification, combining and harmonizing the openly available PhysioNet CinC Challenge 2021 and the Shandong Provincial Hospital datasets for our study. Our results show that K-fold cross-validation, both on single-source and multi-source data, systemically overestimates prediction performance when the end goal is to generalize to new sources. Leave-source-out cross-validation provides more reliable performance estimates, having close to zero bias though larger variability. The evaluation highlights the dangers of obtaining misleading cross-validation results on medical data and demonstrates how these issues can be mitigated when having access to multi-source data.

Empirical investigation of multi-source cross-validation in clinical ECG classification

TL;DR

K-fold cross-validation, both on single-source and multi-source data, systemically overestimates prediction performance when the end goal is to generalize to new sources, while leave-source-out cross-validation provides more reliable performance estimates, having close to zero bias though larger variability.

Abstract

Traditionally, machine learning-based clinical prediction models have been trained and evaluated on patient data from a single source, such as a hospital. Cross-validation methods can be used to estimate the accuracy of such models on new patients originating from the same source, by repeated random splitting of the data. However, such estimates tend to be highly overoptimistic when compared to accuracy obtained from deploying models to sources not represented in the dataset, such as a new hospital. The increasing availability of multi-source medical datasets provides new opportunities for obtaining more comprehensive and realistic evaluations of expected accuracy through source-level cross-validation designs. In this study, we present a systematic empirical evaluation of standard K-fold cross-validation and leave-source-out cross-validation methods in a multi-source setting. We consider the task of electrocardiogram based cardiovascular disease classification, combining and harmonizing the openly available PhysioNet CinC Challenge 2021 and the Shandong Provincial Hospital datasets for our study. Our results show that K-fold cross-validation, both on single-source and multi-source data, systemically overestimates prediction performance when the end goal is to generalize to new sources. Leave-source-out cross-validation provides more reliable performance estimates, having close to zero bias though larger variability. The evaluation highlights the dangers of obtaining misleading cross-validation results on medical data and demonstrates how these issues can be mitigated when having access to multi-source data.
Paper Structure (27 sections, 5 figures, 5 tables)

This paper contains 27 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Number of diagnoses per patient per data source. The last bar summarizes the range from 5 to 12.
  • Figure 2: Illustration of the difference between the K-fold CV and the LSO CV. The training data is gathered from $n$ distinct sources. In the K-fold CV, a training set is first pooled from all the sources and then randomly divided into K disjoint folds, whereas in the LSO CV, the data sources correspond to distinct folds. Each split corresponds to one round of CV, where a model is trained on the union of train folds and used to make predictions on the validation fold. The final performance estimate is the average of the validation fold performances. The final model is trained on the full training set consisting of all the data from the $n$ sources. A test source refers to out-of-sample data that is not part of the $n$ sources used for training and cross-validation, but on which the model would later be applied on.
  • Figure 3: AUC ($\pm$ SD) scores for the 5-fold CV and individual test sets per training set
  • Figure 4: AUC ($\pm \text{SD}$) scores for the 4-fold CV, the LSO CV and the test results per test set
  • Figure 5: Confusion matrices for each input set in the multiclass classification task where the output is the data source of ECGs