Stabilizing Machine Learning for Reproducible and Explainable Results: A Novel Validation Approach to Subject-Specific Insights
Gideon Vos, Liza van Eijk, Zoltan Sarnyai, Mostafa Rahimi Azghadi
TL;DR
Reproducibility and explainability in biomedical ML are undermined by sensitivity to random seeds and validation choices. The authors introduce a randomized-trials validation framework that leverages a single general Random Forest model and up to 400 trials per subject to stabilize both model performance and feature importance at subject- and group-levels. Across nine open datasets, the method yields stable feature importance that aligns with established clinical biomarkers while maintaining competitive predictive accuracy, and it reduces seed- and validation-induced variability. The approach thus offers a practical, explainable alternative to costly subject-specific models, with open-source code to enable replication and broader adoption in biomarker discovery and personalized medicine.
Abstract
Machine Learning is transforming medical research by improving diagnostic accuracy and personalizing treatments. General ML models trained on large datasets identify broad patterns across populations, but their effectiveness is often limited by the diversity of human biology. This has led to interest in subject-specific models that use individual data for more precise predictions. However, these models are costly and challenging to develop. To address this, we propose a novel validation approach that uses a general ML model to ensure reproducible performance and robust feature importance analysis at both group and subject-specific levels. We tested a single Random Forest (RF) model on nine datasets varying in domain, sample size, and demographics. Different validation techniques were applied to evaluate accuracy and feature importance consistency. To introduce variability, we performed up to 400 trials per subject, randomly seeding the ML algorithm for each trial. This generated 400 feature sets per subject, from which we identified top subject-specific features. A group-specific feature importance set was then derived from all subject-specific results. We compared our approach to conventional validation methods in terms of performance and feature importance consistency. Our repeated trials approach, with random seed variation, consistently identified key features at the subject level and improved group-level feature importance analysis using a single general model. Subject-specific models address biological variability but are resource-intensive. Our novel validation technique provides consistent feature importance and improved accuracy within a general ML model, offering a practical and explainable alternative for clinical research.
