Table of Contents
Fetching ...

Pearls from Pebbles: Improved Confidence Functions for Auto-labeling

Harit Vishwakarma, Reid, Chen, Sui Jiet Tay, Satya Sai Srinath Namburi, Frederic Sala, Ramya Korlakai Vinayak

TL;DR

Threshold-based auto-labeling (TBAL) reduces labeling cost by auto-labeling confident predictions, but relies on confidence scores that are often miscalibrated. Colander provides a principled, post-hoc learning framework to optimize confidence functions and per-class thresholds for TBAL, using empirical surrogates and a validation split. Across multiple datasets and baselines, Colander achieves substantial gains in auto-labeling coverage while keeping auto-labeling error under a fixed tolerance, often up to about 60% improvement. This approach offers a scalable route to high-quality labeled data with reduced manual labeling, and it highlights the mismatch between calibration goals and TBAL objectives.

Abstract

Auto-labeling is an important family of techniques that produce labeled training sets with minimum manual labeling. A prominent variant, threshold-based auto-labeling (TBAL), works by finding a threshold on a model's confidence scores above which it can accurately label unlabeled data points. However, many models are known to produce overconfident scores, leading to poor TBAL performance. While a natural idea is to apply off-the-shelf calibration methods to alleviate the overconfidence issue, such methods still fall short. Rather than experimenting with ad-hoc choices of confidence functions, we propose a framework for studying the \emph{optimal} TBAL confidence function. We develop a tractable version of the framework to obtain \texttt{Colander} (Confidence functions for Efficient and Reliable Auto-labeling), a new post-hoc method specifically designed to maximize performance in TBAL systems. We perform an extensive empirical evaluation of our method \texttt{Colander} and compare it against methods designed for calibration. \texttt{Colander} achieves up to 60\% improvements on coverage over the baselines while maintaining auto-labeling error below $5\%$ and using the same amount of labeled data as the baselines.

Pearls from Pebbles: Improved Confidence Functions for Auto-labeling

TL;DR

Threshold-based auto-labeling (TBAL) reduces labeling cost by auto-labeling confident predictions, but relies on confidence scores that are often miscalibrated. Colander provides a principled, post-hoc learning framework to optimize confidence functions and per-class thresholds for TBAL, using empirical surrogates and a validation split. Across multiple datasets and baselines, Colander achieves substantial gains in auto-labeling coverage while keeping auto-labeling error under a fixed tolerance, often up to about 60% improvement. This approach offers a scalable route to high-quality labeled data with reduced manual labeling, and it highlights the mismatch between calibration goals and TBAL objectives.

Abstract

Auto-labeling is an important family of techniques that produce labeled training sets with minimum manual labeling. A prominent variant, threshold-based auto-labeling (TBAL), works by finding a threshold on a model's confidence scores above which it can accurately label unlabeled data points. However, many models are known to produce overconfident scores, leading to poor TBAL performance. While a natural idea is to apply off-the-shelf calibration methods to alleviate the overconfidence issue, such methods still fall short. Rather than experimenting with ad-hoc choices of confidence functions, we propose a framework for studying the \emph{optimal} TBAL confidence function. We develop a tractable version of the framework to obtain \texttt{Colander} (Confidence functions for Efficient and Reliable Auto-labeling), a new post-hoc method specifically designed to maximize performance in TBAL systems. We perform an extensive empirical evaluation of our method \texttt{Colander} and compare it against methods designed for calibration. \texttt{Colander} achieves up to 60\% improvements on coverage over the baselines while maintaining auto-labeling error below and using the same amount of labeled data as the baselines.
Paper Structure (32 sections, 12 equations, 8 figures, 7 tables, 2 algorithms)

This paper contains 32 sections, 12 equations, 8 figures, 7 tables, 2 algorithms.

Figures (8)

  • Figure 1: High-level diagram of an auto-labeling system. It takes unlabeled data as input and, with the help of expert labelers and ML models, outputs a labeled dataset.
  • Figure 2: Scores distributions (Kernel Density Estimates) of a CNN model trained on CIFAR-10 data. (a) softmax scores of vanilla training procedure (SGD) (b) scores after post-hoc calibration using temperature scaling and (c) scores from our Colander procedure applied on the same model. For training the CNN model we use 4000 points drawn randomly, and the number of validation points is 1000 (of which 500 are used for Temp. Scaling and Colander ). The test accuracy of the model is 55%. Figures (d) and (e) show the coverage and auto-labeling error of these methods. The dotted-red line corresponds to a 5% error threshold.
  • Figure 3: Threshold-based Auto-labeling with Colander. Similar to the existing TBAL (Figure \ref{['fig:auto-labeling-io']}) it takes unlabeled data as input, selects a small subset of data points, and obtains human labels for them to create $D_{\mathrm{train}}^{(i)}$ and ${D^{(i)}_{\mathrm{val}}}$ for the $i$th iteration. Then it trains model $\hat{{\mathcal{h} }}_i$ on $D_{\mathrm{train}}^{(i)}$. In contrast to the standard TBAL procedure, here we randomly split ${D^{(i)}_{\mathrm{val}}}$ into two parts ${D^{(i)}_{\mathrm{cal}}}$ and ${D^{(i)}_{\mathrm{th}}}$. Then Colander kicks in, it takes $\hat{{\mathcal{h} }}_i$ and ${D^{(i)}_{\mathrm{cal}}}$ as input and learns coverage maximizing confidence function $\hat{{\mathcal{g} }}_i$ for $\hat{{\mathcal{h} }}_i$. Then using ${D^{(i)}_{\mathrm{th}}}$ and $\hat{{\mathcal{g} }}_i$ auto-labeling thresholds $\hat{{\bf t}}_i$ are determined to ensure the auto-labeled data as error at most $\epsilon_a$. After obtaining the thresholds the rest of the steps are the same as the standard TBAL. The whole workflow runs in a loop until all the data is labeled or some other stopping criteria are achieved.
  • Figure 4: Our choice of ${\mathcal{g} }$ function.
  • Figure 5: Illustration of the tightness of surrogate error and coverage functions based on the choice of $\alpha$.
  • ...and 3 more figures