Table of Contents
Fetching ...

Empirical Comparison between Cross-Validation and Mutation-Validation in Model Selection

Jinyang Yu, Sami Hamdan, Leonard Sasse, Abigail Morrison, Kaustubh R. Patil

TL;DR

It is found that both MV and CV select models with practically equivalent generalization performance across various machine learning algorithms and the majority of benchmark datasets.

Abstract

Mutation validation (MV) is a recently proposed approach for model selection, garnering significant interest due to its unique characteristics and potential benefits compared to the widely used cross-validation (CV) method. In this study, we empirically compared MV and $k$-fold CV using benchmark and real-world datasets. By employing Bayesian tests, we compared generalization estimates yielding three posterior probabilities: practical equivalence, CV superiority, and MV superiority. We also evaluated the differences in the capacity of the selected models and computational efficiency. We found that both MV and CV select models with practically equivalent generalization performance across various machine learning algorithms and the majority of benchmark datasets. MV exhibited advantages in terms of selecting simpler models and lower computational costs. However, in some cases MV selected overly simplistic models leading to underfitting and showed instability in hyperparameter selection. These limitations of MV became more evident in the evaluation of a real-world neuroscientific task of predicting sex at birth using brain functional connectivity.

Empirical Comparison between Cross-Validation and Mutation-Validation in Model Selection

TL;DR

It is found that both MV and CV select models with practically equivalent generalization performance across various machine learning algorithms and the majority of benchmark datasets.

Abstract

Mutation validation (MV) is a recently proposed approach for model selection, garnering significant interest due to its unique characteristics and potential benefits compared to the widely used cross-validation (CV) method. In this study, we empirically compared MV and -fold CV using benchmark and real-world datasets. By employing Bayesian tests, we compared generalization estimates yielding three posterior probabilities: practical equivalence, CV superiority, and MV superiority. We also evaluated the differences in the capacity of the selected models and computational efficiency. We found that both MV and CV select models with practically equivalent generalization performance across various machine learning algorithms and the majority of benchmark datasets. MV exhibited advantages in terms of selecting simpler models and lower computational costs. However, in some cases MV selected overly simplistic models leading to underfitting and showed instability in hyperparameter selection. These limitations of MV became more evident in the evaluation of a real-world neuroscientific task of predicting sex at birth using brain functional connectivity.
Paper Structure (5 sections, 1 equation, 3 figures, 2 tables)

This paper contains 5 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: An overview of four algorithms evaluated on 12 benchmark datasets. Each subfigure consists of four sectors, one for each algorithm, with dataset indices cross-referenced in Table \ref{['tab1']}. (a) Each sector displays three tracks representing the posterior probabilities $P_\mathrm{P.E.}$, $P_\mathrm{CV}$, and $P_\mathrm{MV}$ for each case. These probabilities are presented as a heat-map. (b) The points represent samples drawn from the posterior probability distribution of 4000 samplings (default setting Corani2017). The final posterior probabilities $P_\mathrm{P.E.}$, $P_\mathrm{CV}$, and $P_\mathrm{MV}$ are located in the corners of each sector. (c) For each algorithm, boxplots indicate results obtained from the top 100 hyperparameter values generated by the comparison framework. Note that the dropout rate in the last sector is displayed on an inverted vertical axis, inline with the interpretation of capacity across all four sectors.
  • Figure 2: The three sectors of each subfigure correspond to $k=3$-, $k=5$-, and $k=10$-CV. (a) This subfigure contains results obtained from Bayesian correlated t-test across the 12 benchmark datasets. The indices of the datasets are listed in Table \ref{['tab1']}. In each sector, there are three tracks, representing the three posterior probabilities $P_\mathrm{P.E.}$, $P_\mathrm{CV}$, and $P_\mathrm{MV}$. (b) The results from CV are shown in black and those from MV are shown in red. In each sector, the horizontal axis lists the indices of the benchmark datasets. The left vertical axis shows the total runtime of the procedure, and the right vertical axis shows the equivalent $\mathrm{CO}_2$ emission.
  • Figure 3: (a) The Bayesian correlated t-test was used to calculate $P_\mathrm{P.E.}$ and $P_\mathrm{CV}$ across subsets of the FC domain with varying numbers of selected best features. The above sector shows the results obtained from polynomial KRC, while the below sector displays those obtained from polynomial SVM. The probability curves in purple, yellow, and blue correspond to the datasets ID1000, PIOP1, and PIOP2, respectively. (b) The mean of the 100 best polynomial degrees across subsets of the FC domain ID1000, PIOP1, and PIOP2 for the polynomial KRC and SVM algorithms. Each point in the plot represents the mean of the polynomial degrees and the error bars demonstrate the standard deviation. The shaded areas in each sector shows the difference between the mean polynomial degrees from CV and MV.