Promises and Pitfalls of Threshold-based Auto-labeling

Harit Vishwakarma; Heguang Lin; Frederic Sala; Ramya Korlakai Vinayak

Promises and Pitfalls of Threshold-based Auto-labeling

Harit Vishwakarma, Heguang Lin, Frederic Sala, Ramya Korlakai Vinayak

TL;DR

This is the first work to analyze TBAL systems and derive sample complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data.

Abstract

Creating large-scale high-quality labeled datasets is a major bottleneck in supervised machine learning workflows. Threshold-based auto-labeling (TBAL), where validation data obtained from humans is used to find a confidence threshold above which the data is machine-labeled, reduces reliance on manual annotation. TBAL is emerging as a widely-used solution in practice. Given the long shelf-life and diverse usage of the resulting datasets, understanding when the data obtained by such auto-labeling systems can be relied on is crucial. This is the first work to analyze TBAL systems and derive sample complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data. Our results provide two crucial insights. First, reasonable chunks of unlabeled data can be automatically and accurately labeled by seemingly bad models. Second, a hidden downside of TBAL systems is potentially prohibitive validation data usage. Together, these insights describe the promise and pitfalls of using such systems. We validate our theoretical guarantees with extensive experiments on synthetic and real datasets.

Promises and Pitfalls of Threshold-based Auto-labeling

TL;DR

This is the first work to analyze TBAL systems and derive sample complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data.

Abstract

Paper Structure (30 sections, 13 theorems, 61 equations, 11 figures, 17 tables, 2 algorithms)

This paper contains 30 sections, 13 theorems, 61 equations, 11 figures, 17 tables, 2 algorithms.

Introduction
Threshold-Based Auto-Labeling Algorithm
Problem Setup
Notation.
Description of the algorithm
Comparison between Auto-Labeling, Active Learning and Selective Classification
Theoretical Analysis
Linear Classifier Setting
Experiments
Role of Validation Data
Role of Training Data Size
Related Works
Conclusion and Future Work
Acknowledgments
Definitions and Notation
...and 15 more sections

Key Result

Theorem 3.2

(Overall Auto-Labeling Error and Coverage) Let $k$ denote the number of rounds of the TBAL Algorithm alg:main_algo. Let $n_v^{(i)}, n_a^{(i)}$ denote the number of validation and auto-labeled points at epoch $i$ and $n^{(i)} = |X^{(i)}|$. Let $X_{pool}(A_k)$ be the set of auto-labeled points at the

Figures (11)

Figure 1: High-level workflow threshold-based auto-labeling (TBAL). Box (B) shows the key component estimating the auto-labeling region using validation data and auto-labeling points in it.
Figure 2: Comparison of TBAL, active learning (AL) followed by selective classification (AL+SC) and passive learning (PL) followed b selective classification (PL+SC) on the Circles dataset (Sec. \ref{['sec:autolabelingComparisons']}) using linear classifiers and confidence functions. (a) Samples auto-labeled, queried, and left unlabeled. (b) The auto-labeling error and coverage achieved by the algorithms. (50 trials.)
Figure 3: Left: Simplified upper bound from Corollary 3.4 (ignoring constants) on excess auto-labeling error for the Unit-Ball setting i.e. homogeneous linear classifier with $d=30$. Right: The worst observed auto-labeling error over 25 trials in the Unit-Ball experiment.
Figure 4: Results for varying $N_q$, the maximum number of samples algorithm can use for training while providing sufficient validation samples.
Figure 5: Comparison of Threshold-Based Auto-Labeling (TBAL) and Active-Learning followed by Selective Classification (AL+SC) on XOR-dataset. Left figure (a) shows samples that were auto-labeled, queried, and left unlabeled by these methods. Right figure (b) shows the auto-labeling error and coverage achieved. The lines show the mean and the shaded region shows 1-standard deviation estimated over 10 trials with different random seeds.
...and 6 more figures

Theorems & Definitions (24)

Definition 3.1
Theorem 3.2
Lemma 3.3
Corollary 3.4
Definition A.1
proof
Lemma B.1
proof
Lemma B.2
proof
...and 14 more

Promises and Pitfalls of Threshold-based Auto-labeling

TL;DR

Abstract

Promises and Pitfalls of Threshold-based Auto-labeling

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (24)