Table of Contents
Fetching ...

ActiveDP: Bridging Active Learning and Data Programming

Naiqing Guan, Nick Koudas

TL;DR

ActiveDP presents a novel interactive framework that bridges data programming and active learning to produce labels with both high accuracy and broad coverage. It introduces the ADP sampler for balanced query selection, LabelPick for efficient LF pruning via a Markov Blanket approach, and ConFusion for confidence-based label aggregation that leverages both DP and AL signals. Empirical results on textual and tabular datasets show ActiveDP consistently outperforms state-of-the-art weak supervision and active learning baselines across diverse labeling budgets, with robust performance under label noise. The work highlights the practical value of combining weak supervision with instance-level labeling and demonstrates scalable improvements for downstream classifiers in real-world labeling scenarios.

Abstract

Modern machine learning models require large labelled datasets to achieve good performance, but manually labelling large datasets is expensive and time-consuming. The data programming paradigm enables users to label large datasets efficiently but produces noisy labels, which deteriorates the downstream model's performance. The active learning paradigm, on the other hand, can acquire accurate labels but only for a small fraction of instances. In this paper, we propose ActiveDP, an interactive framework bridging active learning and data programming together to generate labels with both high accuracy and coverage, combining the strengths of both paradigms. Experiments show that ActiveDP outperforms previous weak supervision and active learning approaches and consistently performs well under different labelling budgets.

ActiveDP: Bridging Active Learning and Data Programming

TL;DR

ActiveDP presents a novel interactive framework that bridges data programming and active learning to produce labels with both high accuracy and broad coverage. It introduces the ADP sampler for balanced query selection, LabelPick for efficient LF pruning via a Markov Blanket approach, and ConFusion for confidence-based label aggregation that leverages both DP and AL signals. Empirical results on textual and tabular datasets show ActiveDP consistently outperforms state-of-the-art weak supervision and active learning baselines across diverse labeling budgets, with robust performance under label noise. The work highlights the practical value of combining weak supervision with instance-level labeling and demonstrates scalable improvements for downstream classifiers in real-world labeling scenarios.

Abstract

Modern machine learning models require large labelled datasets to achieve good performance, but manually labelling large datasets is expensive and time-consuming. The data programming paradigm enables users to label large datasets efficiently but produces noisy labels, which deteriorates the downstream model's performance. The active learning paradigm, on the other hand, can acquire accurate labels but only for a small fraction of instances. In this paper, we propose ActiveDP, an interactive framework bridging active learning and data programming together to generate labels with both high accuracy and coverage, combining the strengths of both paradigms. Experiments show that ActiveDP outperforms previous weak supervision and active learning approaches and consistently performs well under different labelling budgets.
Paper Structure (21 sections, 3 equations, 3 figures, 5 tables)

This paper contains 21 sections, 3 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Workflow of ActiveDP. Left: iterative LF creation at training phase. Right: label aggregation at inference phase.
  • Figure 2: Workflow of the LF selection module in ActiveDP.
  • Figure 3: End-to-end Performance comparison between ActiveDP and Baseline Methods.