Table of Contents
Fetching ...

A Set of Rules for Model Validation

José Camacho

TL;DR

This paper addresses the challenge of devising general, practical guidelines for validating data-driven models beyond ad hoc schemes. It proposes five general rules focusing on independent data splits, realistic test-population alignment, objective application-specific performance criteria, baseline and null-model checks, and uncertainty-aware model comparisons. The authors illustrate the framework with a PLS-DA validation example and emphasize the use of double CV, permutation testing, and null data to detect data leakage and assess significance. The contributions provide a transparent, adaptable approach to validation that can support reporting, comparability, and potential standardization in applied data science.

Abstract

The validation of a data-driven model is the process of assessing the model's ability to generalize to new, unseen data in the population of interest. This paper proposes a set of general rules for model validation. These rules are designed to help practitioners create reliable validation plans and report their results transparently. While no validation scheme is flawless, these rules can help practitioners ensure their strategy is sufficient for practical use, openly discuss any limitations of their validation strategy, and report clear, comparable performance metrics.

A Set of Rules for Model Validation

TL;DR

This paper addresses the challenge of devising general, practical guidelines for validating data-driven models beyond ad hoc schemes. It proposes five general rules focusing on independent data splits, realistic test-population alignment, objective application-specific performance criteria, baseline and null-model checks, and uncertainty-aware model comparisons. The authors illustrate the framework with a PLS-DA validation example and emphasize the use of double CV, permutation testing, and null data to detect data leakage and assess significance. The contributions provide a transparent, adaptable approach to validation that can support reporting, comparability, and potential standardization in applied data science.

Abstract

The validation of a data-driven model is the process of assessing the model's ability to generalize to new, unseen data in the population of interest. This paper proposes a set of general rules for model validation. These rules are designed to help practitioners create reliable validation plans and report their results transparently. While no validation scheme is flawless, these rules can help practitioners ensure their strategy is sufficient for practical use, openly discuss any limitations of their validation strategy, and report clear, comparable performance metrics.

Paper Structure

This paper contains 10 sections, 6 figures.

Figures (6)

  • Figure 1: Receiver Operating Characteristic (ROC) curves for unbalanced data.
  • Figure 2: Performance results based on the Number of Misclassifications (NMC) for unbalanced data.
  • Figure 3: Performance results based on the weighted number of misclassifications for unbalanced data with a minority class of 1%.
  • Figure 4: Cross-validation curve for Partial Least Squares (PLS) in a simulated dataset where $\mathbf{X} (20 \times 10)$ and $\mathbf{y} (20 \times 1)$ are unrelated.
  • Figure 5: Cross-validation curve and $Q^2$ of double cross-validation for Partial Least Squares (PLS) in a simulated dataset where $\mathbf{X} (20 \times 1000)$ and $\mathbf{y} (20 \times 1)$ are unrelated: variable selection performed before validation (a) and variable selection performed within the inner loop (b).
  • ...and 1 more figures