Table of Contents
Fetching ...

Is K-fold cross validation the best model selection method for Machine Learning?

Juan M Gorriz, R. Martin Clemente, F Segovia, J Ramirez, A Ortiz, J. Suckling

TL;DR

The performance with simulated and neuroimaging datasets suggests that K-fold CUBV is a robust criterion for detecting effects and validating accuracy values obtained from machine learning and classical CV schemes, while avoiding excess false positives.

Abstract

As a technique that can compactly represent complex patterns, machine learning has significant potential for predictive inference. K-fold cross-validation (CV) is the most common approach to ascertaining the likelihood that a machine learning outcome is generated by chance, and it frequently outperforms conventional hypothesis testing. This improvement uses measures directly obtained from machine learning classifications, such as accuracy, that do not have a parametric description. To approach a frequentist analysis within machine learning pipelines, a permutation test or simple statistics from data partitions (i.e., folds) can be added to estimate confidence intervals. Unfortunately, neither parametric nor non-parametric tests solve the inherent problems of partitioning small sample-size datasets and learning from heterogeneous data sources. The fact that machine learning strongly depends on the learning parameters and the distribution of data across folds recapitulates familiar difficulties around excess false positives and replication. A novel statistical test based on K-fold CV and the Upper Bound of the actual risk (K-fold CUBV) is proposed, where uncertain predictions of machine learning with CV are bounded by the worst case through the evaluation of concentration inequalities. Probably Approximately Correct-Bayesian upper bounds for linear classifiers in combination with K-fold CV are derived and used to estimate the actual risk. The performance with simulated and neuroimaging datasets suggests that K-fold CUBV is a robust criterion for detecting effects and validating accuracy values obtained from machine learning and classical CV schemes, while avoiding excess false positives.

Is K-fold cross validation the best model selection method for Machine Learning?

TL;DR

The performance with simulated and neuroimaging datasets suggests that K-fold CUBV is a robust criterion for detecting effects and validating accuracy values obtained from machine learning and classical CV schemes, while avoiding excess false positives.

Abstract

As a technique that can compactly represent complex patterns, machine learning has significant potential for predictive inference. K-fold cross-validation (CV) is the most common approach to ascertaining the likelihood that a machine learning outcome is generated by chance, and it frequently outperforms conventional hypothesis testing. This improvement uses measures directly obtained from machine learning classifications, such as accuracy, that do not have a parametric description. To approach a frequentist analysis within machine learning pipelines, a permutation test or simple statistics from data partitions (i.e., folds) can be added to estimate confidence intervals. Unfortunately, neither parametric nor non-parametric tests solve the inherent problems of partitioning small sample-size datasets and learning from heterogeneous data sources. The fact that machine learning strongly depends on the learning parameters and the distribution of data across folds recapitulates familiar difficulties around excess false positives and replication. A novel statistical test based on K-fold CV and the Upper Bound of the actual risk (K-fold CUBV) is proposed, where uncertain predictions of machine learning with CV are bounded by the worst case through the evaluation of concentration inequalities. Probably Approximately Correct-Bayesian upper bounds for linear classifiers in combination with K-fold CV are derived and used to estimate the actual risk. The performance with simulated and neuroimaging datasets suggests that K-fold CUBV is a robust criterion for detecting effects and validating accuracy values obtained from machine learning and classical CV schemes, while avoiding excess false positives.
Paper Structure (44 sections, 5 theorems, 33 equations, 24 figures)

This paper contains 44 sections, 5 theorems, 33 equations, 24 figures.

Key Result

Theorem 1

For any constant $\lambda>1/2$, and a class of classifiers $\mathcal{F}$ that are selected according to the distribution $Q$, we have that with probability at least $1-\eta$ over the draw of the sample, the following CI hold for all the distributions $Q$: where $D_{KL}(Q||Q_u)\equiv\mathbb{E}_{f\sim Q}[\ln \frac{Q(f)}{Q_u(f)}]$ is the Kullback-Leibler divergence from $Q$ to the uniform distributi

Figures (24)

  • Figure 1: Left Column: Null distribution of accuracy values using K-fold CV (in green font) obtained from sampling the pdf (middle) and permuting the fold distribution (bottom). In blue we show the proposed K-fold CUBV method to control FP in this null experiment. Note that in this example dimension $n=2$ and Cohen's $d=0$. Middle column: example of a classification problem using linear decision functions and samples drawn from two Gaussian pdfs with Cohen's $d=2$ similar to the problem described in section \ref{['sec:null']}. Averaged accuracy and its standard deviation versus sample size are displayed in green font for the standard K-fold CV, $K=10$. The theoretical error achieved by linear classifiers and the whole dataset ($2*10^4$ samples) in this problem is displayed by the black line. Right column: Example of a classification problem with a single sample generated following the procedure described in Gorriz19 and a $F=100$-fold accuracy distribution for $K=10$.
  • Figure 2: Performance of K-fold CV in common experimental designs. Typical large biobanks include data across modalities including neuroimaging, biosensors, genetic, clinical, omics, etc. data. A set of synthetic and real MRI samples, obtained from $N_c$ sources and expressed as $n$ dimensional features, are analysed. The theoretical error achieved by a (linear) classifier can be assessed by resubstitution on the infinite population (pdf). Then, the K-fold CV error is estimated by (theoretically) sampling this pdf ($M$ times) or by (realistically) permuting the learning folds using a single realization of the sample ($F$ times). The proposed CUBV test rejects the null-hypothesis in the blue-shaded area.
  • Figure 3: Examples of performance, FP rates and MC performance evaluation across independent (multi-sample) experiments
  • Figure 4: Examples of performance, FP rates and MC performance evaluation in single sample experiments.
  • Figure 5: The accuracy values (average and standard deviation) obtained in K-fold CV versus complexity ($N_c$) and sample size $N$ with $M=1000$ and a large effect size, in a $n=1$ (top) and $n=6$ (bottom) binary classification task.
  • ...and 19 more figures

Theorems & Definitions (10)

  • Definition 1
  • Definition 2
  • Theorem 1
  • Definition 3
  • Definition 4
  • Definition 5
  • Proposition 1
  • Lemma 1
  • Proposition 2
  • Proposition 3