Table of Contents
Fetching ...

ScriptoriumWS: A Code Generation Assistant for Weak Supervision

Tzu-Heng Huang, Catherine Cao, Spencer Schoenberg, Harit Vishwakarma, Nicholas Roberts, Frederic Sala

TL;DR

ScriptoriumWS tackles the data labeling bottleneck by using code-generation models to synthesize programmatic weak supervision labeling functions, preserving the benefits of programmatic weak supervision while reducing human effort. It introduces a multi-tier prompting strategy and demonstrates that synthesized LFs can match human-designed LFs in accuracy while delivering substantially higher coverage on the WRENCH benchmark, with notable gains on SMS and Spouse datasets and downstream F1 improvements. The system is designed to integrate with standard PWS pipelines and can complement hand-crafted LFs to further boost end-model performance. The work highlights practical implications for scalable, private, and cost-effective data labeling in real-world ML applications.

Abstract

Weak supervision is a popular framework for overcoming the labeled data bottleneck: the need to obtain labels for training data. In weak supervision, multiple noisy-but-cheap sources are used to provide guesses of the label and are aggregated to produce high-quality pseudolabels. These sources are often expressed as small programs written by domain experts -- and so are expensive to obtain. Instead, we argue for using code-generation models to act as coding assistants for crafting weak supervision sources. We study prompting strategies to maximize the quality of the generated sources, settling on a multi-tier strategy that incorporates multiple types of information. We explore how to best combine hand-written and generated sources. Using these insights, we introduce ScriptoriumWS, a weak supervision system that, when compared to hand-crafted sources, maintains accuracy and greatly improves coverage.

ScriptoriumWS: A Code Generation Assistant for Weak Supervision

TL;DR

ScriptoriumWS tackles the data labeling bottleneck by using code-generation models to synthesize programmatic weak supervision labeling functions, preserving the benefits of programmatic weak supervision while reducing human effort. It introduces a multi-tier prompting strategy and demonstrates that synthesized LFs can match human-designed LFs in accuracy while delivering substantially higher coverage on the WRENCH benchmark, with notable gains on SMS and Spouse datasets and downstream F1 improvements. The system is designed to integrate with standard PWS pipelines and can complement hand-crafted LFs to further boost end-model performance. The work highlights practical implications for scalable, private, and cost-effective data labeling in real-world ML applications.

Abstract

Weak supervision is a popular framework for overcoming the labeled data bottleneck: the need to obtain labels for training data. In weak supervision, multiple noisy-but-cheap sources are used to provide guesses of the label and are aggregated to produce high-quality pseudolabels. These sources are often expressed as small programs written by domain experts -- and so are expensive to obtain. Instead, we argue for using code-generation models to act as coding assistants for crafting weak supervision sources. We study prompting strategies to maximize the quality of the generated sources, settling on a multi-tier strategy that incorporates multiple types of information. We explore how to best combine hand-written and generated sources. Using these insights, we introduce ScriptoriumWS, a weak supervision system that, when compared to hand-crafted sources, maintains accuracy and greatly improves coverage.

Paper Structure

This paper contains 24 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of the proposed ScriptoriumWS system. Code generation models are prompted to produce small programs that act as weak supervision labeling functions. These are used within a weak supervision pipeline to label an unlabeled dataset. A downstream end model is trained on the labeled data.
  • Figure 2: An example of synthesized LF using general prompt strategy for the YouTube spam classification task.
  • Figure 3: An example of a synthesized LF by using human heuristic strategy for the YouTube spam classification task
  • Figure 4: Two synthesized LF examples generated by adding label function examples (left) and data examples (right) for the YouTube spam classification task. We can see that code generation model takes the given label function example as reference and learn the relationship between data examples and their expected outputs to extend and synthesize it own program.