A Set of Rules for Model Validation
José Camacho
TL;DR
This paper addresses the challenge of devising general, practical guidelines for validating data-driven models beyond ad hoc schemes. It proposes five general rules focusing on independent data splits, realistic test-population alignment, objective application-specific performance criteria, baseline and null-model checks, and uncertainty-aware model comparisons. The authors illustrate the framework with a PLS-DA validation example and emphasize the use of double CV, permutation testing, and null data to detect data leakage and assess significance. The contributions provide a transparent, adaptable approach to validation that can support reporting, comparability, and potential standardization in applied data science.
Abstract
The validation of a data-driven model is the process of assessing the model's ability to generalize to new, unseen data in the population of interest. This paper proposes a set of general rules for model validation. These rules are designed to help practitioners create reliable validation plans and report their results transparently. While no validation scheme is flawless, these rules can help practitioners ensure their strategy is sufficient for practical use, openly discuss any limitations of their validation strategy, and report clear, comparable performance metrics.
