Table of Contents
Fetching ...

The ALCHEmist: Automated Labeling 500x CHEaper Than LLM Data Annotators

Tzu-Heng Huang, Catherine Cao, Vaishnavi Bhargava, Frederic Sala

TL;DR

The paper tackles the cost and auditability barriers of using large pretrained models for data annotation by proposing Alchemist, a system that prompts language models to generate labeling programs rather than直接 output labels. These programs run locally, can be inspected and extended, and are combined via weak supervision (Snorkel) to produce high-quality pseudolabels, which are then used to train a distilled model. Empirically, Alchemist achieves similar or better accuracy than LLM-based annotation on multiple text datasets while reducing labeling costs by about 500x and improving average performance by roughly 12.9%. The authors extend the approach to multimodal data by extracting high-level concepts with LLMs and using local multimodal features (e.g., CLIP) to generate labeling programs, and they demonstrate robustness gains from supplementary information and program diversity, as well as improvements over human-crafted labeling functions in several tasks.

Abstract

Large pretrained models can be used as annotators, helping replace or augment crowdworkers and enabling distilling generalist models into smaller specialist models. Unfortunately, this comes at a cost: employing top-of-the-line models often requires paying thousands of dollars for API calls, while the resulting datasets are static and challenging to audit. To address these challenges, we propose a simple alternative: rather than directly querying labels from pretrained models, we task models to generate programs that can produce labels. These programs can be stored and applied locally, re-used and extended, and cost orders of magnitude less. Our system, Alchemist, obtains comparable to or better performance than large language model-based annotation in a range of tasks for a fraction of the cost: on average, improvements amount to a 12.9% enhancement while the total labeling costs across all datasets are reduced by a factor of approximately 500x.

The ALCHEmist: Automated Labeling 500x CHEaper Than LLM Data Annotators

TL;DR

The paper tackles the cost and auditability barriers of using large pretrained models for data annotation by proposing Alchemist, a system that prompts language models to generate labeling programs rather than直接 output labels. These programs run locally, can be inspected and extended, and are combined via weak supervision (Snorkel) to produce high-quality pseudolabels, which are then used to train a distilled model. Empirically, Alchemist achieves similar or better accuracy than LLM-based annotation on multiple text datasets while reducing labeling costs by about 500x and improving average performance by roughly 12.9%. The authors extend the approach to multimodal data by extracting high-level concepts with LLMs and using local multimodal features (e.g., CLIP) to generate labeling programs, and they demonstrate robustness gains from supplementary information and program diversity, as well as improvements over human-crafted labeling functions in several tasks.

Abstract

Large pretrained models can be used as annotators, helping replace or augment crowdworkers and enabling distilling generalist models into smaller specialist models. Unfortunately, this comes at a cost: employing top-of-the-line models often requires paying thousands of dollars for API calls, while the resulting datasets are static and challenging to audit. To address these challenges, we propose a simple alternative: rather than directly querying labels from pretrained models, we task models to generate programs that can produce labels. These programs can be stored and applied locally, re-used and extended, and cost orders of magnitude less. Our system, Alchemist, obtains comparable to or better performance than large language model-based annotation in a range of tasks for a fraction of the cost: on average, improvements amount to a 12.9% enhancement while the total labeling costs across all datasets are reduced by a factor of approximately 500x.
Paper Structure (23 sections, 5 figures, 12 tables)

This paper contains 23 sections, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Examples of generated programs and their prompts. These are synthesized by GPT-4 for spam detection and cancer identification tasks. Programs use regular expressions (left program) and keyword matching (right program) as their labeling logic to classify data points.
  • Figure 2: Overall workflow for Alchemist.
  • Figure 3: Alchemist can handle rich modalities through a simple extension. First, a language model identifies task-specific concepts (top). Then, a local multimodal model is used as a feature extractor for these concepts, producing low-dimensional feature vectors that can be ingested by generated labeling programs.
  • Figure 4: Program examples generated by GPT4o on Waterbirds dataset. The left program is synthesized by directly asking for a labeling program when the input is an image (raw pixels), while the right program uses Alchemist's extension. The former labels birds using the dominant color in the image, which can be predicted incorrectly due to spurious correlations (e.g., background).
  • Figure 5: Performance is reported using their average performance and standard deviations. Results indicate that the label model is improved when the number of diverse programs increases.