Table of Contents
Fetching ...

How many classifiers do we need?

Hyunsuk Kim, Liam Hodgkinson, Ryan Theisen, Michael W. Mahoney

TL;DR

This paper provides a detailed analysis of how the disagreement and the polarization among classifiers relate to the performance gain achieved by aggregating individual classifiers, for majority vote strategies in classification tasks.

Abstract

As performance gains through scaling data and/or model size experience diminishing returns, it is becoming increasingly popular to turn to ensembling, where the predictions of multiple models are combined to improve accuracy. In this paper, we provide a detailed analysis of how the disagreement and the polarization (a notion we introduce and define in this paper) among classifiers relate to the performance gain achieved by aggregating individual classifiers, for majority vote strategies in classification tasks. We address these questions in the following ways. (1) An upper bound for polarization is derived, and we propose what we call a neural polarization law: most interpolating neural network models are 4/3-polarized. Our empirical results not only support this conjecture but also show that polarization is nearly constant for a dataset, regardless of hyperparameters or architectures of classifiers. (2) The error of the majority vote classifier is considered under restricted entropy conditions, and we present a tight upper bound that indicates that the disagreement is linearly correlated with the target, and that the slope is linear in the polarization. (3) We prove results for the asymptotic behavior of the disagreement in terms of the number of classifiers, which we show can help in predicting the performance for a larger number of classifiers from that of a smaller number. Our theories and claims are supported by empirical results on several image classification tasks with various types of neural networks.

How many classifiers do we need?

TL;DR

This paper provides a detailed analysis of how the disagreement and the polarization among classifiers relate to the performance gain achieved by aggregating individual classifiers, for majority vote strategies in classification tasks.

Abstract

As performance gains through scaling data and/or model size experience diminishing returns, it is becoming increasingly popular to turn to ensembling, where the predictions of multiple models are combined to improve accuracy. In this paper, we provide a detailed analysis of how the disagreement and the polarization (a notion we introduce and define in this paper) among classifiers relate to the performance gain achieved by aggregating individual classifiers, for majority vote strategies in classification tasks. We address these questions in the following ways. (1) An upper bound for polarization is derived, and we propose what we call a neural polarization law: most interpolating neural network models are 4/3-polarized. Our empirical results not only support this conjecture but also show that polarization is nearly constant for a dataset, regardless of hyperparameters or architectures of classifiers. (2) The error of the majority vote classifier is considered under restricted entropy conditions, and we present a tight upper bound that indicates that the disagreement is linearly correlated with the target, and that the slope is linear in the polarization. (3) We prove results for the asymptotic behavior of the disagreement in terms of the number of classifiers, which we show can help in predicting the performance for a larger number of classifiers from that of a smaller number. Our theories and claims are supported by empirical results on several image classification tasks with various types of neural networks.

Paper Structure

This paper contains 25 sections, 17 theorems, 57 equations, 4 figures.

Key Result

Proposition 1

Competent ensembles are $2$-polarized.

Figures (4)

  • Figure 1: Polarizations $\eta_\rho$ obtained from ResNet18 trained on CIFAR-10 with various sets of hyper-parameters tested on (a) an out-of-sample CIFAR-10 and (b) an out-of-distribution dataset, CIFAR-10.1. Red dashed line indicates $y=4/3$, a suggested value of polarization appears in Theorem \ref{['thm:polarization4/3']} and Conjecture \ref{['conj:polarization4/3']}.
  • Figure 2: Polarization $\eta_\rho$ obtained (a) from various architectures trained on CIFAR-10 and (b) only from interpolating classifiers trained on various datasets. Red dashed line indicates $y=4/3$. In subplot (b), we observe that the polarization of all interpolating models expect one are smaller than $4/3$, which aligns with Conjecture \ref{['conj:polarization4/3']}.
  • Figure 3: Comparing our new bound from Corollary \ref{['cor:finite-hmv']} (colored black), which is the right hand side of inequality \ref{['ieq:finite-hmv']}, with bounds from previous studies. Green corresponds to the C-bound in inequality \ref{['ieq:cbound']}, and blue corresponds to the right hand side of inequality \ref{['ieq:ryan-second-order-bound']}. ResNet18, ResNet50, ResNet101 models with various sets of hyperparameters are trained on CIFAR-10 then tested on (a) the out-of-sample CIFAR-10, (b) an out-of-distribution dataset, CIFAR-10.1
  • Figure 4: Comparing the estimated (extrapolated) majority vote error rates in equation \ref{['eq:fbound']} (blue-dashed lines) and \ref{['eq:fbound3']} (orange-dashed lines) with the true majority vote error (green solid line) for each number of classifiers. The solid sky-blue line corresponds to the average error rate of constituent classifiers. Subplots (a1), (b), (c), (d), (e) show the results from different pairs of (classification model, dataset). Subplot (a2) overlays the right hand side of inequality \ref{['ieq:cbound']} (C-bound, colored red) and inequality \ref{['ieq:ryan-second-order-bound']} (theisen2023 bound, colored purple) on the subplot (a1). These two quantities from previous studies are much larger compared to the average error rate. We see the same pattern for other (architecture, dataset) pairs, which we therefore omit from the plot. For more details on these empirical results, see Appendix \ref{['app:exp']}.

Theorems & Definitions (35)

  • Definition 1: Polarization
  • Proposition 1
  • Theorem 1
  • Definition 2: Interpolating, belkin2019reconciling
  • Conjecture 1: Neural Polarization Law
  • Theorem 2
  • Theorem 3
  • Corollary 1: Finite Ensemble
  • Theorem 4
  • Corollary 2
  • ...and 25 more