Block-regularized 5$\times$2 Cross-validated McNemar's Test for Comparing Two Classification Algorithms

Jing Yang; Ruibo Wang; Yijun Song; Jihong Li

Block-regularized 5$\times$2 Cross-validated McNemar's Test for Comparing Two Classification Algorithms

Jing Yang, Ruibo Wang, Yijun Song, Jihong Li

TL;DR

This work addresses the challenge of statistically comparing two classifiers with McNemar's test by integrating cross-validation in a principled way. It introduces a block-regularized $5\times 2$ CV (5×2 BCV) scheme and compresses the resulting ten correlated contingency tables into an effective contingency table $\mathcal{C}_e$, analyzed via a Bayesian lens with correlation coefficients $\rho_1$ and $\rho_2$. The resulting statistic $\mathcal{M}^{\text{BCV}}$ (with a conservative fix $\rho_1=\rho_2=0.5$, yielding $t=20/11$) provides a robust test for comparing error rates, achieving controlled Type I error and improved power across extensive synthetic and real-world datasets. The findings advocate using the $5\times 2$ BCV McNemar's test in practical algorithm comparisons and suggest avenues for extending the approach to $m\times 2$ BCV and further Bayesian refinements.

Abstract

In the task of comparing two classification algorithms, the widely-used McNemar's test aims to infer the presence of a significant difference between the error rates of the two classification algorithms. However, the power of the conventional McNemar's test is usually unpromising because the hold-out (HO) method in the test merely uses a single train-validation split that usually produces a highly varied estimation of the error rates. In contrast, a cross-validation (CV) method repeats the HO method in multiple times and produces a stable estimation. Therefore, a CV method has a great advantage to improve the power of McNemar's test. Among all types of CV methods, a block-regularized 5$\times$2 CV (BCV) has been shown in many previous studies to be superior to the other CV methods in the comparison task of algorithms because the 5$\times$2 BCV can produce a high-quality estimator of the error rate by regularizing the numbers of overlapping records between all training sets. In this study, we compress the 10 correlated contingency tables in the 5$\times$2 BCV to form an effective contingency table. Then, we define a 5$\times$2 BCV McNemar's test on the basis of the effective contingency table. We demonstrate the reasonable type I error and the promising power of the proposed 5$\times$2 BCV McNemar's test on multiple simulated and real-world data sets.

Block-regularized 5$\times$2 Cross-validated McNemar's Test for Comparing Two Classification Algorithms

TL;DR

This work addresses the challenge of statistically comparing two classifiers with McNemar's test by integrating cross-validation in a principled way. It introduces a block-regularized

CV (5×2 BCV) scheme and compresses the resulting ten correlated contingency tables into an effective contingency table

, analyzed via a Bayesian lens with correlation coefficients

and

. The resulting statistic

(with a conservative fix

, yielding

) provides a robust test for comparing error rates, achieving controlled Type I error and improved power across extensive synthetic and real-world datasets. The findings advocate using the

BCV McNemar's test in practical algorithm comparisons and suggest avenues for extending the approach to

BCV and further Bayesian refinements.

Abstract

2 CV (BCV) has been shown in many previous studies to be superior to the other CV methods in the comparison task of algorithms because the 5

2 BCV can produce a high-quality estimator of the error rate by regularizing the numbers of overlapping records between all training sets. In this study, we compress the 10 correlated contingency tables in the 5

2 BCV to form an effective contingency table. Then, we define a 5

2 BCV McNemar's test on the basis of the effective contingency table. We demonstrate the reasonable type I error and the promising power of the proposed 5

2 BCV McNemar's test on multiple simulated and real-world data sets.

Paper Structure (15 sections, 7 theorems, 27 equations, 4 figures, 5 tables)

This paper contains 15 sections, 7 theorems, 27 equations, 4 figures, 5 tables.

Introduction
Conventional McNemar's Test
McNemar's test on an HO validation
Bayesian interpretation of $\mathcal{C}$
Naïve $K$-Fold CV McNemar's Test
$5\times 2$ BCV McNemar's Test
$5\times 2$ BCV
Contingency tables on $5\times 2$ BCV
Properties of $\rho_1$ and $\rho_2$
Effective contingency table on a $5\times 2$ BCV
$5\times 2$ BCV McNemar's test
Experimental Results and Analysis
Experiments for RQ1
Experiments for RQ2
Conclusion

Key Result

Lemma 1

Given that $\mathcal{C}|\boldsymbol{\pi}\sim \mathbf{M}(n_2,\boldsymbol{\pi})$, we obtain where $\mathbf{B}(\cdot,\cdot)$ represents a binomial distribution.

Figures (4)

Figure 1: Scatter plot of $\rho_1$ and $\rho_2$.
Figure 2: Demonstration of an EXP6 data set with $n=300$.
Figure 3: Simulations of true error rates of algorithms on the synthetic and real-world data sets
Figure 4: Power curves of different tests on the synthetic and real-world data sets.

Theorems & Definitions (7)

Lemma 1
Lemma 2
Corollary 1
Theorem 1
Lemma 3
Theorem 2
Theorem 3

Block-regularized 5$\times$2 Cross-validated McNemar's Test for Comparing Two Classification Algorithms

TL;DR

Abstract

Block-regularized 5$\times$2 Cross-validated McNemar's Test for Comparing Two Classification Algorithms

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (7)