ScriptoriumWS: A Code Generation Assistant for Weak Supervision
Tzu-Heng Huang, Catherine Cao, Spencer Schoenberg, Harit Vishwakarma, Nicholas Roberts, Frederic Sala
TL;DR
ScriptoriumWS tackles the data labeling bottleneck by using code-generation models to synthesize programmatic weak supervision labeling functions, preserving the benefits of programmatic weak supervision while reducing human effort. It introduces a multi-tier prompting strategy and demonstrates that synthesized LFs can match human-designed LFs in accuracy while delivering substantially higher coverage on the WRENCH benchmark, with notable gains on SMS and Spouse datasets and downstream F1 improvements. The system is designed to integrate with standard PWS pipelines and can complement hand-crafted LFs to further boost end-model performance. The work highlights practical implications for scalable, private, and cost-effective data labeling in real-world ML applications.
Abstract
Weak supervision is a popular framework for overcoming the labeled data bottleneck: the need to obtain labels for training data. In weak supervision, multiple noisy-but-cheap sources are used to provide guesses of the label and are aggregated to produce high-quality pseudolabels. These sources are often expressed as small programs written by domain experts -- and so are expensive to obtain. Instead, we argue for using code-generation models to act as coding assistants for crafting weak supervision sources. We study prompting strategies to maximize the quality of the generated sources, settling on a multi-tier strategy that incorporates multiple types of information. We explore how to best combine hand-written and generated sources. Using these insights, we introduce ScriptoriumWS, a weak supervision system that, when compared to hand-crafted sources, maintains accuracy and greatly improves coverage.
