Table of Contents
Fetching ...

Misconduct in Post-Selections and Deep Learning

Juyang Weng

TL;DR

This paper reveals that using cross-validation for data splits is insufficient to exonerate Post-Selections in machine learning and in general, Post-Selections of statistical learners based on their errors on the validation set are statistically invalid.

Abstract

This is a theoretical paper on "Deep Learning" misconduct in particular and Post-Selection in general. As far as the author knows, the first peer-reviewed papers on Deep Learning misconduct are [32], [37], [36]. Regardless of learning modes, e.g., supervised, reinforcement, adversarial, and evolutional, almost all machine learning methods (except for a few methods that train a sole system) are rooted in the same misconduct -- cheating and hiding -- (1) cheating in the absence of a test and (2) hiding bad-looking data. It was reasoned in [32], [37], [36] that authors must report at least the average error of all trained networks, good and bad, on the validation set (called general cross-validation in this paper). Better, report also five percentage positions of ranked errors. From the new analysis here, we can see that the hidden culprit is Post-Selection. This is also true for Post-Selection on hand-tuned or searched hyperparameters, because they are random, depending on random observation data. Does cross-validation on data splits rescue Post-Selections from the Misconducts (1) and (2)? The new result here says: No. Specifically, this paper reveals that using cross-validation for data splits is insufficient to exonerate Post-Selections in machine learning. In general, Post-Selections of statistical learners based on their errors on the validation set are statistically invalid.

Misconduct in Post-Selections and Deep Learning

TL;DR

This paper reveals that using cross-validation for data splits is insufficient to exonerate Post-Selections in machine learning and in general, Post-Selections of statistical learners based on their errors on the validation set are statistically invalid.

Abstract

This is a theoretical paper on "Deep Learning" misconduct in particular and Post-Selection in general. As far as the author knows, the first peer-reviewed papers on Deep Learning misconduct are [32], [37], [36]. Regardless of learning modes, e.g., supervised, reinforcement, adversarial, and evolutional, almost all machine learning methods (except for a few methods that train a sole system) are rooted in the same misconduct -- cheating and hiding -- (1) cheating in the absence of a test and (2) hiding bad-looking data. It was reasoned in [32], [37], [36] that authors must report at least the average error of all trained networks, good and bad, on the validation set (called general cross-validation in this paper). Better, report also five percentage positions of ranked errors. From the new analysis here, we can see that the hidden culprit is Post-Selection. This is also true for Post-Selection on hand-tuned or searched hyperparameters, because they are random, depending on random observation data. Does cross-validation on data splits rescue Post-Selections from the Misconducts (1) and (2)? The new result here says: No. Specifically, this paper reveals that using cross-validation for data splits is insufficient to exonerate Post-Selections in machine learning. In general, Post-Selections of statistical learners based on their errors on the validation set are statistically invalid.
Paper Structure (19 sections, 5 theorems, 15 equations, 2 figures)

This paper contains 19 sections, 5 theorems, 15 equations, 2 figures.

Key Result

Theorem 1

The minimum MSE estimate of a random variable $e$ from $n$ random samples, $e(\theta_i)$, $i=1, 2, ... , n$, is its probability mean Eq. EQ:e*. Thus, the general cross-validation should use Eq. EQ:e*n if we assume each sample is equally likely.

Figures (2)

  • Figure 1: A 1D-terrain illustration for the fit error (dashed curve) from the fit data set, the post-selection validation error (solid thin curve) from the validation data set, and the unknown test error (thick green curve) from a future test set. The green and blue NN-balls end at the max-post pit and the luckiest-post pit, respectively. The red NN-ball will miss the lowest post error. Only if $n$ is large ($n=3$ here), can the validation error from all $n$ network weight samples (i.e., cross-validation) better predict the expected error on the unknown test set. The validation set and the test set have similar distributions here but are disjoint. Figure modified from WengCLAIEE22.
  • Figure 2: Nest cross-validation for post-selections. Operator abbreviations: F: Fit. P: Post-Selection. V: Validation. A: Average. The nest cross-validation has two stages, early cross-validation (blue) before Post-Selection based on $V$, and the later cross-validation (red) after Post-Selection. A blue arrowed A is due to the early cross-validation using blue data folds as validation. The red-arrowed A is due to the latter cross-validation using the red data folds as validation. Cross-validation results in a system that consists of all networks (or a sufficient number of representatives) that participate in the average performance. The post-selection P here selects the luckiest single network (or few $m \ll n$) according to the error on the blue validation set.

Theorems & Definitions (5)

  • Theorem 1: general cross-validation
  • Theorem 2: Lost Luck
  • Theorem 3: Must report all trained networks
  • Theorem 4: Input Cross-Validated Post-Selection
  • Theorem 5: Nest-Cross-Validated Post-Selection