Table of Contents
Fetching ...

Don't Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget

Florian E. Dorner, Moritz Hardt

TL;DR

If the goal is to identify the better of two classifiers, it's shown it's best to spend the budget on collecting a single label for more samples, which follows from a non-trivial application of Cram\'er's theorem.

Abstract

We study how to best spend a budget of noisy labels to compare the accuracy of two binary classifiers. It's common practice to collect and aggregate multiple noisy labels for a given data point into a less noisy label via a majority vote. We prove a theorem that runs counter to conventional wisdom. If the goal is to identify the better of two classifiers, we show it's best to spend the budget on collecting a single label for more samples. Our result follows from a non-trivial application of Cramér's theorem, a staple in the theory of large deviations. We discuss the implications of our work for the design of machine learning benchmarks, where they overturn some time-honored recommendations. In addition, our results provide sample size bounds superior to what follows from Hoeffding's bound.

Don't Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget

TL;DR

If the goal is to identify the better of two classifiers, it's shown it's best to spend the budget on collecting a single label for more samples, which follows from a non-trivial application of Cram\'er's theorem.

Abstract

We study how to best spend a budget of noisy labels to compare the accuracy of two binary classifiers. It's common practice to collect and aggregate multiple noisy labels for a given data point into a less noisy label via a majority vote. We prove a theorem that runs counter to conventional wisdom. If the goal is to identify the better of two classifiers, we show it's best to spend the budget on collecting a single label for more samples. Our result follows from a non-trivial application of Cramér's theorem, a staple in the theory of large deviations. We discuss the implications of our work for the design of machine learning benchmarks, where they overturn some time-honored recommendations. In addition, our results provide sample size bounds superior to what follows from Hoeffding's bound.
Paper Structure (19 sections, 16 theorems, 168 equations, 3 figures)

This paper contains 19 sections, 16 theorems, 168 equations, 3 figures.

Key Result

Theorem 1.1

For a sufficiently large sample budget $k$, the probability of identifying the better of two binary classifiers is maximized at $m=1$ labels per data point.

Figures (3)

  • Figure 1: Number of testable classifiers according to the Hoeffding (a) and Cramér-based (b) upper bounds on the error probability and a union bound (see Section \ref{['sec:bench']}) for accuracies $p=q=0.75$, margin $\epsilon=0.1$ and error tolerance $\delta=0.05$. Note the different $y$ axes.
  • Figure 2: Probability of identifying $c_b$ for accuracy $p=0.8$, margin $\epsilon=0.01$, budget $k=1500$ (a), label accuracy $q=0.8$ (b).
  • Figure 3: a): Convergence of normalized log error rates to the values implied by Cramér's Theorem for label accuracy $q=0.75$, classifier accuracy $p=0.7$, margin $\epsilon=0.1$ and $m\in\{1,3 \}$. b): Upper bounds on normalized log error rate for Cramér's bound compared to Hoeffding's bound.

Theorems & Definitions (25)

  • Theorem 1.1: Informal
  • Proposition 3.1
  • Lemma 3.1
  • Proposition 3.2
  • Lemma 4.1
  • Theorem 4.1
  • Proposition 2.1
  • proof
  • Proposition 2.2
  • proof
  • ...and 15 more