Probably Approximately Correct Labels

Emmanuel J. Candès; Andrew Ilyas; Tijana Zrnic

Probably Approximately Correct Labels

Emmanuel J. Candès, Andrew Ilyas, Tijana Zrnic

TL;DR

The paper tackles the expensive process of obtaining high-quality labels by introducing PAC labeling, which guarantees that the average labeling error is at most $\epsilon$ with probability at least $1-\alpha$ while using cheap AI predictions for most samples. It develops a single-model core method that uses uncertainty scores to determine which instances require expert labels, and extends this to multi-model settings via a PAC router that learns to route data to the most cost-effective predictor. The authors integrate uncertainty calibration through multicalibration and derive differentiable routing with implicit gradients for end-to-end optimization, including a cost-sensitive variant that accounts for per-source costs. Empirically, PAC labeling achieves substantial budget savings across text, vision, and proteomics tasks while maintaining the required error guarantees, demonstrating a practical framework for cost-efficient, high-quality dataset curation using modern AI models.

Abstract

Obtaining high-quality labeled datasets is often costly, requiring either human annotation or expensive experiments. In theory, powerful pre-trained AI models provide an opportunity to automatically label datasets and save costs. Unfortunately, these models come with no guarantees on their accuracy, making wholesale replacement of manual labeling impractical. In this work, we propose a method for leveraging pre-trained AI models to curate cost-effective and high-quality datasets. In particular, our approach results in probably approximately correct labels: with high probability, the overall labeling error is small. Our method is nonasymptotically valid under minimal assumptions on the dataset or the AI model being studied, and thus enables rigorous yet efficient dataset curation using modern AI models. We demonstrate the benefits of the methodology through text annotation with large language models, image labeling with pre-trained vision models, and protein folding analysis with AlphaFold.

Probably Approximately Correct Labels

TL;DR

The paper tackles the expensive process of obtaining high-quality labels by introducing PAC labeling, which guarantees that the average labeling error is at most

with probability at least

while using cheap AI predictions for most samples. It develops a single-model core method that uses uncertainty scores to determine which instances require expert labels, and extends this to multi-model settings via a PAC router that learns to route data to the most cost-effective predictor. The authors integrate uncertainty calibration through multicalibration and derive differentiable routing with implicit gradients for end-to-end optimization, including a cost-sensitive variant that accounts for per-source costs. Empirically, PAC labeling achieves substantial budget savings across text, vision, and proteomics tasks while maintaining the required error guarantees, demonstrating a practical framework for cost-efficient, high-quality dataset curation using modern AI models.

Probably Approximately Correct Labels

TL;DR

Abstract

Probably Approximately Correct Labels

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (3)