Table of Contents
Fetching ...

Probably Approximately Correct Labels

Emmanuel J. Candès, Andrew Ilyas, Tijana Zrnic

TL;DR

The paper tackles the expensive process of obtaining high-quality labels by introducing PAC labeling, which guarantees that the average labeling error is at most $\epsilon$ with probability at least $1-\alpha$ while using cheap AI predictions for most samples. It develops a single-model core method that uses uncertainty scores to determine which instances require expert labels, and extends this to multi-model settings via a PAC router that learns to route data to the most cost-effective predictor. The authors integrate uncertainty calibration through multicalibration and derive differentiable routing with implicit gradients for end-to-end optimization, including a cost-sensitive variant that accounts for per-source costs. Empirically, PAC labeling achieves substantial budget savings across text, vision, and proteomics tasks while maintaining the required error guarantees, demonstrating a practical framework for cost-efficient, high-quality dataset curation using modern AI models.

Abstract

Obtaining high-quality labeled datasets is often costly, requiring either human annotation or expensive experiments. In theory, powerful pre-trained AI models provide an opportunity to automatically label datasets and save costs. Unfortunately, these models come with no guarantees on their accuracy, making wholesale replacement of manual labeling impractical. In this work, we propose a method for leveraging pre-trained AI models to curate cost-effective and high-quality datasets. In particular, our approach results in probably approximately correct labels: with high probability, the overall labeling error is small. Our method is nonasymptotically valid under minimal assumptions on the dataset or the AI model being studied, and thus enables rigorous yet efficient dataset curation using modern AI models. We demonstrate the benefits of the methodology through text annotation with large language models, image labeling with pre-trained vision models, and protein folding analysis with AlphaFold.

Probably Approximately Correct Labels

TL;DR

The paper tackles the expensive process of obtaining high-quality labels by introducing PAC labeling, which guarantees that the average labeling error is at most with probability at least while using cheap AI predictions for most samples. It develops a single-model core method that uses uncertainty scores to determine which instances require expert labels, and extends this to multi-model settings via a PAC router that learns to route data to the most cost-effective predictor. The authors integrate uncertainty calibration through multicalibration and derive differentiable routing with implicit gradients for end-to-end optimization, including a cost-sensitive variant that accounts for per-source costs. Empirically, PAC labeling achieves substantial budget savings across text, vision, and proteomics tasks while maintaining the required error guarantees, demonstrating a practical framework for cost-efficient, high-quality dataset curation using modern AI models.

Abstract

Obtaining high-quality labeled datasets is often costly, requiring either human annotation or expensive experiments. In theory, powerful pre-trained AI models provide an opportunity to automatically label datasets and save costs. Unfortunately, these models come with no guarantees on their accuracy, making wholesale replacement of manual labeling impractical. In this work, we propose a method for leveraging pre-trained AI models to curate cost-effective and high-quality datasets. In particular, our approach results in probably approximately correct labels: with high probability, the overall labeling error is small. Our method is nonasymptotically valid under minimal assumptions on the dataset or the AI model being studied, and thus enables rigorous yet efficient dataset curation using modern AI models. We demonstrate the benefits of the methodology through text annotation with large language models, image labeling with pre-trained vision models, and protein folding analysis with AlphaFold.

Paper Structure

This paper contains 20 sections, 2 theorems, 17 equations, 6 figures, 7 tables, 2 algorithms.

Key Result

Theorem 2.1

The labels $\tilde{Y}_i = Y_i \mathbf{1}\{U_i \geq \hat{u}\} + \hat{Y}_i \mathbf{1}\{U_i < \hat{u}\}$, with $\hat{u}$ given by eq:uhat, are PAC labels eq:labeling_guarantee.

Figures (6)

  • Figure 1: Illustration of PAC labeling. The procedure estimates an uncertainty threshold $\hat{u}$ and collects expert labels for all points where $U_i \geq \hat{u}$.
  • Figure 2: PAC labeling for discrete labels. Realized error and save in budget for PAC labeling, the naive thresholding baseline, and the AI only baseline. Each row and column correspond to a different dataset and value of $\epsilon$ (denoted by vertical dashed line), respectively. For PAC labeling, we plot the realized error and save in budget for $50$ randomly chosen trials. For the naive thresholding baseline, we collect expert labels for all points with $U_i \geq \epsilon$.
  • Figure 3: PAC labeling for continuous labels. Realized error and save in budget for PAC labeling and the AI only baseline. Each row and column correspond to a different dataset and value of $\epsilon$ (denoted by vertical dashed line), respectively. For PAC labeling, we plot the realized error and save in budget for $50$ randomly chosen trials.
  • Figure 4: PAC router for language models. Realized error and save in budget for PAC labeling with GPT, PAC labeling with Claude, and the PAC router between GPT and Claude. The top row corresponds to the costless setting; the bottom row corresponds to the cost-sensitive setting. Each column corresponds to a different value of $\epsilon$ (denoted by vertical dashed line). For each method, we plot the realized error and save in budget for $50$ randomly chosen trials.
  • Figure 5: Loss $L^u$ after PAC routing. Error $L^u$ after collecting labels at uncertainties greater than or equal to $u$, as a function of $u$, for GPT and Claude individually and the PAC router. We observe that the router achieves a lower error $L^u$ than the individual baselines, for all $u$.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Theorem 2.1
  • proof
  • Corollary 2.1