Table of Contents
Fetching ...

Which Leakage Types Matter?

Simon Roth

Abstract

Twenty-eight within-subject counterfactual experiments across 2,047 tabular datasets, plus a boundary experiment on 129 temporal datasets, measuring the severity of four data leakage classes in machine learning. Class I (estimation - fitting scalers on full data) is negligible: all nine conditions produce $|Δ\text{AUC}| \leq 0.005$. Class II (selection - peeking, seed cherry-picking) is substantial: ~90% of the measured effect is noise exploitation that inflates reported scores. Class III (memorization) scales with model capacity: d_z = 0.37 (Naive Bayes) to 1.11 (Decision Tree). Class IV (boundary) is invisible under random CV. The textbook emphasis is inverted: normalization leakage matters least; selection leakage at practical dataset sizes matters most.

Which Leakage Types Matter?

Abstract

Twenty-eight within-subject counterfactual experiments across 2,047 tabular datasets, plus a boundary experiment on 129 temporal datasets, measuring the severity of four data leakage classes in machine learning. Class I (estimation - fitting scalers on full data) is negligible: all nine conditions produce . Class II (selection - peeking, seed cherry-picking) is substantial: ~90% of the measured effect is noise exploitation that inflates reported scores. Class III (memorization) scales with model capacity: d_z = 0.37 (Naive Bayes) to 1.11 (Decision Tree). Class IV (boundary) is invisible under random CV. The textbook emphasis is inverted: normalization leakage matters least; selection leakage at practical dataset sizes matters most.

Paper Structure

This paper contains 36 sections, 1 equation, 6 figures.

Figures (6)

  • Figure 1: Distribution of $\Delta$AUC across leakage experiments, grouped by leakage class. Class I (estimation, teal) centers on zero. Class II (selection, red) shows persistent positive inflation. Class III (memorization, blue) leakage varies with model capacity and duplication rate.
  • Figure 2: Peeking inflation distribution across 2,047 datasets at $k = 10$. The distribution is right-skewed with 92% positive prevalence.
  • Figure 3: Seed inflation dose-response: RF inflation grows logarithmically with K seeds while LR remains deterministic at zero.
  • Figure 4: Capacity amplification. Each line connects six algorithms ordered by capacity (NB → LR → XGB → RF → KNN → DT) at a fixed duplication rate. Higher duplication shifts all algorithms upward (intercept), but the lines also fan out: the gap between constrained (LR) and flexible (DT) models widens from $\Delta$AUC = 0.011 at 5% to 0.064 at 30%, revealing a capacity $\times$ duplication interaction.
  • Figure 5: N-scaling separates the leakage classes. (a,b) Class II extends to $n = 10{,}000$: peeking retains a diversity residual; seed decays to near-zero (pure noise exploitation). (c,d) Class I and III shown at $n = 50$--$2{,}000$ only: normalization is already zero by $n = 200$; oversampling declines steeply but extension data is excluded due to survivorship bias ($N$ drops from 149 to 59 in the imbalanced subset). Thin lines = individual datasets; thick line = mean. Shaded band = interquartile range.
  • ...and 1 more figures