Don't Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget

Florian E. Dorner; Moritz Hardt

Don't Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget

Florian E. Dorner, Moritz Hardt

TL;DR

If the goal is to identify the better of two classifiers, it's shown it's best to spend the budget on collecting a single label for more samples, which follows from a non-trivial application of Cram\'er's theorem.

Abstract

We study how to best spend a budget of noisy labels to compare the accuracy of two binary classifiers. It's common practice to collect and aggregate multiple noisy labels for a given data point into a less noisy label via a majority vote. We prove a theorem that runs counter to conventional wisdom. If the goal is to identify the better of two classifiers, we show it's best to spend the budget on collecting a single label for more samples. Our result follows from a non-trivial application of Cramér's theorem, a staple in the theory of large deviations. We discuss the implications of our work for the design of machine learning benchmarks, where they overturn some time-honored recommendations. In addition, our results provide sample size bounds superior to what follows from Hoeffding's bound.

Don't Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget

TL;DR

Abstract

Paper Structure (19 sections, 16 theorems, 168 equations, 3 figures)

This paper contains 19 sections, 16 theorems, 168 equations, 3 figures.

Introduction
Related Work
Label aggregation in dataset creation.
The impact of label aggregation on learning.
Annotator disagreement as a feature.
The theory of benchmarking.
Formal Setup
Parameterizing the Gap Indicator
Hoeffding Bounds
Correlated Classifiers
Application to Benchmarking
Proof of the Main Theorem
Conclusion
Numerical Evidence
Parameterizations of the Gap Indicator
...and 4 more sections

Key Result

Theorem 1.1

For a sufficiently large sample budget $k$, the probability of identifying the better of two binary classifiers is maximized at $m=1$ labels per data point.

Figures (3)

Figure 1: Number of testable classifiers according to the Hoeffding (a) and Cramér-based (b) upper bounds on the error probability and a union bound (see Section \ref{['sec:bench']}) for accuracies $p=q=0.75$, margin $\epsilon=0.1$ and error tolerance $\delta=0.05$. Note the different $y$ axes.
Figure 2: Probability of identifying $c_b$ for accuracy $p=0.8$, margin $\epsilon=0.01$, budget $k=1500$ (a), label accuracy $q=0.8$ (b).
Figure 3: a): Convergence of normalized log error rates to the values implied by Cramér's Theorem for label accuracy $q=0.75$, classifier accuracy $p=0.7$, margin $\epsilon=0.1$ and $m\in\{1,3 \}$. b): Upper bounds on normalized log error rate for Cramér's bound compared to Hoeffding's bound.

Theorems & Definitions (25)

Theorem 1.1: Informal
Proposition 3.1
Lemma 3.1
Proposition 3.2
Lemma 4.1
Theorem 4.1
Proposition 2.1
proof
Proposition 2.2
proof
...and 15 more

Don't Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget

TL;DR

Abstract

Don't Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (25)