Table of Contents
Fetching ...

A Link between Coding Theory and Cross-Validation with Applications

Tapio Pahikkala, Parisa Movahedi, Ileana Montoya, Havu Miikonen, Stephan Foldes, Antti Airola, Laszlo Major

TL;DR

This work formalizes a precise bridge between cross-validation performance and error-detecting codes, showing that the maximal number of labelings allowing zero LPOCV errors equals the maximal size of a constant-weight code of length $n$, weight $w$, and distance $4$, and extends the theory to $W$-light codes for bounded errors. By recasting LPOCV outcomes as orientations of Johnson graphs, the authors derive upper and lower bounds on the maximal LPOCP capacity and introduce extended Bose-Rao constructions to achieve large code sizes, providing practical tools for LPOCV-based significance tests for AUC that hold for any learning algorithm. The paper also presents simulations and a real MRI dataset to illustrate how empirical critical values can be estimated and how test power depends on data separability and learner choice. These results offer a theoretically grounded framework for hypothesis testing in learning that is robust to the unknowns of the learning algorithm and data distribution, with potential practical impact on model evaluation and validation. The work points to future directions including leveraging algorithmic classes for tighter tests, extending to larger hold-out schemes, and exploring more powerful, code-based tests beyond the current LPOCV setup.

Abstract

How many different binary classification problems a single learning algorithm can solve on a fixed data with exactly zero or at most a given number of cross-validation errors? While the number in the former case is known to be limited by the no-free-lunch theorem, we show that the exact answers are given by the theory of error detecting codes. As a case study, we focus on the AUC performance measure and leave-pair-out cross-validation (LPOCV), in which every possible pair of data with different class labels is held out at a time. We show that the maximal number of classification problems with fixed class proportion, for which a learning algorithm can achieve zero LPOCV error, equals the maximal number of code words in a constant weight code (CWC), with certain technical properties. We then generalize CWCs by introducing light CWCs, and prove an analogous result for nonzero LPOCV errors and light CWCs. Moreover, we prove both upper and lower bounds on the maximal numbers of code words in light CWCs. Finally, as an immediate practical application, we develop new LPOCV based randomization tests for learning algorithms that generalize the classical Wilcoxon-Mann-Whitney U test.

A Link between Coding Theory and Cross-Validation with Applications

TL;DR

This work formalizes a precise bridge between cross-validation performance and error-detecting codes, showing that the maximal number of labelings allowing zero LPOCV errors equals the maximal size of a constant-weight code of length , weight , and distance , and extends the theory to -light codes for bounded errors. By recasting LPOCV outcomes as orientations of Johnson graphs, the authors derive upper and lower bounds on the maximal LPOCP capacity and introduce extended Bose-Rao constructions to achieve large code sizes, providing practical tools for LPOCV-based significance tests for AUC that hold for any learning algorithm. The paper also presents simulations and a real MRI dataset to illustrate how empirical critical values can be estimated and how test power depends on data separability and learner choice. These results offer a theoretically grounded framework for hypothesis testing in learning that is robust to the unknowns of the learning algorithm and data distribution, with potential practical impact on model evaluation and validation. The work points to future directions including leveraging algorithmic classes for tighter tests, extending to larger hold-out schemes, and exploring more powerful, code-based tests beyond the current LPOCV setup.

Abstract

How many different binary classification problems a single learning algorithm can solve on a fixed data with exactly zero or at most a given number of cross-validation errors? While the number in the former case is known to be limited by the no-free-lunch theorem, we show that the exact answers are given by the theory of error detecting codes. As a case study, we focus on the AUC performance measure and leave-pair-out cross-validation (LPOCV), in which every possible pair of data with different class labels is held out at a time. We show that the maximal number of classification problems with fixed class proportion, for which a learning algorithm can achieve zero LPOCV error, equals the maximal number of code words in a constant weight code (CWC), with certain technical properties. We then generalize CWCs by introducing light CWCs, and prove an analogous result for nonzero LPOCV errors and light CWCs. Moreover, we prove both upper and lower bounds on the maximal numbers of code words in light CWCs. Finally, as an immediate practical application, we develop new LPOCV based randomization tests for learning algorithms that generalize the classical Wilcoxon-Mann-Whitney U test.

Paper Structure

This paper contains 12 sections, 9 theorems, 27 equations, 8 figures.

Key Result

Theorem 1

Let $U_\mathcal{A}(\mathbf{y})$ as in (uexch) be a test statistic such that its small values provide evidence against the null hypothesis. For each $\mathbf{y}$, let Then, $p(Y)$ is a valid $p$-value.

Figures (8)

  • Figure 1: The distribution of U-values under the null hypothesis for a sample of size 30 of which 15 labeled with one and a fixed prediction function. This is also known as the Wilcoxon's distribution and it determines the critical values for the WMW-test. The red dashed line denotes the significance level 0.05, that is, 5% of the probability mass is on the left side of the line.
  • Figure 2: Critical values of the one-sided Wilcoxon-Mann-Whitney U test for a significance level 0.05, indicating for different numbers of data labeled with one and zero, what is the maximum number of pairwise errors with which the null hypothesis can still be rejected.
  • Figure 3: The distribution of U-values under the null hypothesis for a sample of size $n=30$ and weight $w=15$ and with the order direction leaner. The red dashed line denotes the significance level 0.05, that is, 5% of the probability mass is on the left side of the line.
  • Figure 4: Left: Illustration of an undirected Johnson graph $J(4,2)$, whose vertices are labeled with the constant weight words of length $4$ and weight $2$, and two vertices are connected with an edge if the Hamming distance between the vertex labels is smaller than 4. Middle: The labels of a subset of the Johnson graph vertices (circled) corresponds to a constant weight code, since the vertices are disconnected from each other. Right: The labels of a subset of the Johnson graph vertices (circled) corresponds to a $1$-light constant weight code, since there exist an orientation (illustrated with arcs) under which the outdegrees of the vertices are at most $1$.
  • Figure 5: An illustration of a Johnson graph $J(5,2)$ and its sub-graph consisting only of vertices with maximum degree 4.
  • ...and 3 more figures

Theorems & Definitions (25)

  • Example 1
  • Example 2
  • Definition 1: leave-pair-out (LPO)
  • Definition 2: Null hypothesis
  • Definition 3: LPOCV
  • Theorem 1
  • Definition 4: Johnson graph
  • Definition 5: $W$-light codes
  • Definition 6: Maximal size $W$ light code
  • Proposition 1
  • ...and 15 more