How You Split Matters: Data Leakage and Subject Characteristics Studies in Longitudinal Brain MRI Analysis
Dewinda Julianensi Rumala
TL;DR
This study interrogates data leakage in longitudinal brain MRI analysis with 3D CNNs, comparing subject-wise, record-wise, and late-wise data splits. Using ADNI-derived T1/T2 MRIs and Grad-CAM, it demonstrates that record-wise and late-wise splits yield inflated cross-validation performance due to identity confounding, whereas subject-wise splitting with hold-out evaluation provides more reliable generalization. The findings underscore the importance of early, subject-wise data partitioning and external validation to ensure robustness in longitudinal MRI classification tasks such as Alzheimer's disease analysis. The work offers practical guidance for evaluating deep learning models in medical imaging and highlights the need for larger, balanced datasets to mitigate under-fitting and demographic biases.
Abstract
Deep learning models have revolutionized the field of medical image analysis, offering significant promise for improved diagnostics and patient care. However, their performance can be misleadingly optimistic due to a hidden pitfall called 'data leakage'. In this study, we investigate data leakage in 3D medical imaging, specifically using 3D Convolutional Neural Networks (CNNs) for brain MRI analysis. While 3D CNNs appear less prone to leakage than 2D counterparts, improper data splitting during cross-validation (CV) can still pose issues, especially with longitudinal imaging data containing repeated scans from the same subject. We explore the impact of different data splitting strategies on model performance for longitudinal brain MRI analysis and identify potential data leakage concerns. GradCAM visualization helps reveal shortcuts in CNN models caused by identity confounding, where the model learns to identify subjects along with diagnostic features. Our findings, consistent with prior research, underscore the importance of subject-wise splitting and evaluating our model further on hold-out data from different subjects to ensure the integrity and reliability of deep learning models in medical image analysis.
