Table of Contents
Fetching ...

Block-regularized 5$\times$2 Cross-validated McNemar's Test for Comparing Two Classification Algorithms

Jing Yang, Ruibo Wang, Yijun Song, Jihong Li

TL;DR

This work addresses the challenge of statistically comparing two classifiers with McNemar's test by integrating cross-validation in a principled way. It introduces a block-regularized $5\times 2$ CV (5×2 BCV) scheme and compresses the resulting ten correlated contingency tables into an effective contingency table $\mathcal{C}_e$, analyzed via a Bayesian lens with correlation coefficients $\rho_1$ and $\rho_2$. The resulting statistic $\mathcal{M}^{\text{BCV}}$ (with a conservative fix $\rho_1=\rho_2=0.5$, yielding $t=20/11$) provides a robust test for comparing error rates, achieving controlled Type I error and improved power across extensive synthetic and real-world datasets. The findings advocate using the $5\times 2$ BCV McNemar's test in practical algorithm comparisons and suggest avenues for extending the approach to $m\times 2$ BCV and further Bayesian refinements.

Abstract

In the task of comparing two classification algorithms, the widely-used McNemar's test aims to infer the presence of a significant difference between the error rates of the two classification algorithms. However, the power of the conventional McNemar's test is usually unpromising because the hold-out (HO) method in the test merely uses a single train-validation split that usually produces a highly varied estimation of the error rates. In contrast, a cross-validation (CV) method repeats the HO method in multiple times and produces a stable estimation. Therefore, a CV method has a great advantage to improve the power of McNemar's test. Among all types of CV methods, a block-regularized 5$\times$2 CV (BCV) has been shown in many previous studies to be superior to the other CV methods in the comparison task of algorithms because the 5$\times$2 BCV can produce a high-quality estimator of the error rate by regularizing the numbers of overlapping records between all training sets. In this study, we compress the 10 correlated contingency tables in the 5$\times$2 BCV to form an effective contingency table. Then, we define a 5$\times$2 BCV McNemar's test on the basis of the effective contingency table. We demonstrate the reasonable type I error and the promising power of the proposed 5$\times$2 BCV McNemar's test on multiple simulated and real-world data sets.

Block-regularized 5$\times$2 Cross-validated McNemar's Test for Comparing Two Classification Algorithms

TL;DR

This work addresses the challenge of statistically comparing two classifiers with McNemar's test by integrating cross-validation in a principled way. It introduces a block-regularized CV (5×2 BCV) scheme and compresses the resulting ten correlated contingency tables into an effective contingency table , analyzed via a Bayesian lens with correlation coefficients and . The resulting statistic (with a conservative fix , yielding ) provides a robust test for comparing error rates, achieving controlled Type I error and improved power across extensive synthetic and real-world datasets. The findings advocate using the BCV McNemar's test in practical algorithm comparisons and suggest avenues for extending the approach to BCV and further Bayesian refinements.

Abstract

In the task of comparing two classification algorithms, the widely-used McNemar's test aims to infer the presence of a significant difference between the error rates of the two classification algorithms. However, the power of the conventional McNemar's test is usually unpromising because the hold-out (HO) method in the test merely uses a single train-validation split that usually produces a highly varied estimation of the error rates. In contrast, a cross-validation (CV) method repeats the HO method in multiple times and produces a stable estimation. Therefore, a CV method has a great advantage to improve the power of McNemar's test. Among all types of CV methods, a block-regularized 52 CV (BCV) has been shown in many previous studies to be superior to the other CV methods in the comparison task of algorithms because the 52 BCV can produce a high-quality estimator of the error rate by regularizing the numbers of overlapping records between all training sets. In this study, we compress the 10 correlated contingency tables in the 52 BCV to form an effective contingency table. Then, we define a 52 BCV McNemar's test on the basis of the effective contingency table. We demonstrate the reasonable type I error and the promising power of the proposed 52 BCV McNemar's test on multiple simulated and real-world data sets.
Paper Structure (15 sections, 7 theorems, 27 equations, 4 figures, 5 tables)

This paper contains 15 sections, 7 theorems, 27 equations, 4 figures, 5 tables.

Key Result

Lemma 1

Given that $\mathcal{C}|\boldsymbol{\pi}\sim \mathbf{M}(n_2,\boldsymbol{\pi})$, we obtain where $\mathbf{B}(\cdot,\cdot)$ represents a binomial distribution.

Figures (4)

  • Figure 1: Scatter plot of $\rho_1$ and $\rho_2$.
  • Figure 2: Demonstration of an EXP6 data set with $n=300$.
  • Figure 3: Simulations of true error rates of algorithms on the synthetic and real-world data sets
  • Figure 4: Power curves of different tests on the synthetic and real-world data sets.

Theorems & Definitions (7)

  • Lemma 1
  • Lemma 2
  • Corollary 1
  • Theorem 1
  • Lemma 3
  • Theorem 2
  • Theorem 3