ActiveDP: Bridging Active Learning and Data Programming
Naiqing Guan, Nick Koudas
TL;DR
ActiveDP presents a novel interactive framework that bridges data programming and active learning to produce labels with both high accuracy and broad coverage. It introduces the ADP sampler for balanced query selection, LabelPick for efficient LF pruning via a Markov Blanket approach, and ConFusion for confidence-based label aggregation that leverages both DP and AL signals. Empirical results on textual and tabular datasets show ActiveDP consistently outperforms state-of-the-art weak supervision and active learning baselines across diverse labeling budgets, with robust performance under label noise. The work highlights the practical value of combining weak supervision with instance-level labeling and demonstrates scalable improvements for downstream classifiers in real-world labeling scenarios.
Abstract
Modern machine learning models require large labelled datasets to achieve good performance, but manually labelling large datasets is expensive and time-consuming. The data programming paradigm enables users to label large datasets efficiently but produces noisy labels, which deteriorates the downstream model's performance. The active learning paradigm, on the other hand, can acquire accurate labels but only for a small fraction of instances. In this paper, we propose ActiveDP, an interactive framework bridging active learning and data programming together to generate labels with both high accuracy and coverage, combining the strengths of both paradigms. Experiments show that ActiveDP outperforms previous weak supervision and active learning approaches and consistently performs well under different labelling budgets.
