Table of Contents
Fetching ...

Promises and Pitfalls of Threshold-based Auto-labeling

Harit Vishwakarma, Heguang Lin, Frederic Sala, Ramya Korlakai Vinayak

TL;DR

This is the first work to analyze TBAL systems and derive sample complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data.

Abstract

Creating large-scale high-quality labeled datasets is a major bottleneck in supervised machine learning workflows. Threshold-based auto-labeling (TBAL), where validation data obtained from humans is used to find a confidence threshold above which the data is machine-labeled, reduces reliance on manual annotation. TBAL is emerging as a widely-used solution in practice. Given the long shelf-life and diverse usage of the resulting datasets, understanding when the data obtained by such auto-labeling systems can be relied on is crucial. This is the first work to analyze TBAL systems and derive sample complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data. Our results provide two crucial insights. First, reasonable chunks of unlabeled data can be automatically and accurately labeled by seemingly bad models. Second, a hidden downside of TBAL systems is potentially prohibitive validation data usage. Together, these insights describe the promise and pitfalls of using such systems. We validate our theoretical guarantees with extensive experiments on synthetic and real datasets.

Promises and Pitfalls of Threshold-based Auto-labeling

TL;DR

This is the first work to analyze TBAL systems and derive sample complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data.

Abstract

Creating large-scale high-quality labeled datasets is a major bottleneck in supervised machine learning workflows. Threshold-based auto-labeling (TBAL), where validation data obtained from humans is used to find a confidence threshold above which the data is machine-labeled, reduces reliance on manual annotation. TBAL is emerging as a widely-used solution in practice. Given the long shelf-life and diverse usage of the resulting datasets, understanding when the data obtained by such auto-labeling systems can be relied on is crucial. This is the first work to analyze TBAL systems and derive sample complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data. Our results provide two crucial insights. First, reasonable chunks of unlabeled data can be automatically and accurately labeled by seemingly bad models. Second, a hidden downside of TBAL systems is potentially prohibitive validation data usage. Together, these insights describe the promise and pitfalls of using such systems. We validate our theoretical guarantees with extensive experiments on synthetic and real datasets.
Paper Structure (30 sections, 13 theorems, 61 equations, 11 figures, 17 tables, 2 algorithms)

This paper contains 30 sections, 13 theorems, 61 equations, 11 figures, 17 tables, 2 algorithms.

Key Result

Theorem 3.2

(Overall Auto-Labeling Error and Coverage) Let $k$ denote the number of rounds of the TBAL Algorithm alg:main_algo. Let $n_v^{(i)}, n_a^{(i)}$ denote the number of validation and auto-labeled points at epoch $i$ and $n^{(i)} = |X^{(i)}|$. Let $X_{pool}(A_k)$ be the set of auto-labeled points at the

Figures (11)

  • Figure 1: High-level workflow threshold-based auto-labeling (TBAL). Box (B) shows the key component estimating the auto-labeling region using validation data and auto-labeling points in it.
  • Figure 2: Comparison of TBAL, active learning (AL) followed by selective classification (AL+SC) and passive learning (PL) followed b selective classification (PL+SC) on the Circles dataset (Sec. \ref{['sec:autolabelingComparisons']}) using linear classifiers and confidence functions. (a) Samples auto-labeled, queried, and left unlabeled. (b) The auto-labeling error and coverage achieved by the algorithms. (50 trials.)
  • Figure 3: Left: Simplified upper bound from Corollary 3.4 (ignoring constants) on excess auto-labeling error for the Unit-Ball setting i.e. homogeneous linear classifier with $d=30$. Right: The worst observed auto-labeling error over 25 trials in the Unit-Ball experiment.
  • Figure 4: Results for varying $N_q$, the maximum number of samples algorithm can use for training while providing sufficient validation samples.
  • Figure 5: Comparison of Threshold-Based Auto-Labeling (TBAL) and Active-Learning followed by Selective Classification (AL+SC) on XOR-dataset. Left figure (a) shows samples that were auto-labeled, queried, and left unlabeled by these methods. Right figure (b) shows the auto-labeling error and coverage achieved. The lines show the mean and the shaded region shows 1-standard deviation estimated over 10 trials with different random seeds.
  • ...and 6 more figures

Theorems & Definitions (24)

  • Definition 3.1
  • Theorem 3.2
  • Lemma 3.3
  • Corollary 3.4
  • Definition A.1
  • proof
  • Lemma B.1
  • proof
  • Lemma B.2
  • proof
  • ...and 14 more