Table of Contents
Fetching ...

Impact of Leakage on Data Harmonization in Machine Learning Pipelines in Class Imbalance Across Sites

Nicolás Nieto, Simon B. Eickhoff, Christian Jung, Martin Reuter, Kersten Diers, Malte Kelm, Artur Lichtenberg, Federico Raimondo, Kaustubh R. Patil

TL;DR

The paper addresses data leakage risks in multisite data harmonization for ML pipelines, revealing that ComBat-based methods can erase biologically relevant variance when site-target dependencies exist. It introduces PrettYharmonize, a leakage-free approach that combines pretended target harmonization with stacking to produce final predictions without needing true test labels. Through controlled MAREoS benchmarks and real MRI/ICU data, the method achieves competitive performance while avoiding leakage, particularly under site-target dependence, and shows limited or no advantage under site-target independence. The work emphasizes careful evaluation of harmonization in ML workflows and provides open-source tools to promote reproducible, leakage-free analyses in medical imaging and clinical prediction tasks.

Abstract

Machine learning (ML) models benefit from large datasets. Collecting data in biomedical domains is costly and challenging, hence, combining datasets has become a common practice. However, datasets obtained under different conditions could present undesired site-specific variability. Data harmonization methods aim to remove site-specific variance while retaining biologically relevant information. This study evaluates the effectiveness of popularly used ComBat-based methods for harmonizing data in scenarios where the class balance is not equal across sites. We find that these methods struggle with data leakage issues. To overcome this problem, we propose a novel approach PrettYharmonize, designed to harmonize data by pretending the target labels. We validate our approach using controlled datasets designed to benchmark the utility of harmonization. Finally, using real-world MRI and clinical data, we compare leakage-prone methods with PrettYharmonize and show that it achieves comparable performance while avoiding data leakage, particularly in site-target-dependence scenarios.

Impact of Leakage on Data Harmonization in Machine Learning Pipelines in Class Imbalance Across Sites

TL;DR

The paper addresses data leakage risks in multisite data harmonization for ML pipelines, revealing that ComBat-based methods can erase biologically relevant variance when site-target dependencies exist. It introduces PrettYharmonize, a leakage-free approach that combines pretended target harmonization with stacking to produce final predictions without needing true test labels. Through controlled MAREoS benchmarks and real MRI/ICU data, the method achieves competitive performance while avoiding leakage, particularly under site-target dependence, and shows limited or no advantage under site-target independence. The work emphasizes careful evaluation of harmonization in ML workflows and provides open-source tools to promote reproducible, leakage-free analyses in medical imaging and clinical prediction tasks.

Abstract

Machine learning (ML) models benefit from large datasets. Collecting data in biomedical domains is costly and challenging, hence, combining datasets has become a common practice. However, datasets obtained under different conditions could present undesired site-specific variability. Data harmonization methods aim to remove site-specific variance while retaining biologically relevant information. This study evaluates the effectiveness of popularly used ComBat-based methods for harmonizing data in scenarios where the class balance is not equal across sites. We find that these methods struggle with data leakage issues. To overcome this problem, we propose a novel approach PrettYharmonize, designed to harmonize data by pretending the target labels. We validate our approach using controlled datasets designed to benchmark the utility of harmonization. Finally, using real-world MRI and clinical data, we compare leakage-prone methods with PrettYharmonize and show that it achieves comparable performance while avoiding data leakage, particularly in site-target-dependence scenarios.

Paper Structure

This paper contains 17 sections, 2 figures, 13 tables.

Figures (2)

  • Figure 1: Age regression a) Site desegregated performance in site-target dependence scenarios. b) Site desegregated performance in site-target independence scenarios.
  • Figure 2: PrettYharmonize training workflow. The workflow showcases the training workflow for a binary classification problem.