Table of Contents
Fetching ...

Leveraging Large Language Models for Structure Learning in Prompted Weak Supervision

Jinyan Su, Peilin Yu, Jieyu Zhang, Stephen H. Bach

TL;DR

This work introduces the Structure Refining Module to enhance Prompted Weak Supervision by learning intrinsic dependencies among prompted LFs via LLM embeddings. It comprises LaRe, which removes redundant prompts, and CosGen, which constructs a sparse dependency graph used by the label model, all without requiring labeled data. Empirical results across three benchmarks show substantial gains over PromptedWS and competitive performance versus data-driven structure learning, with notable efficiency benefits. The approach demonstrates how embedding-based LF similarity can robustly identify correlations, enabling more accurate and scalable weak supervision pipelines using large language models.

Abstract

Prompted weak supervision (PromptedWS) applies pre-trained large language models (LLMs) as the basis for labeling functions (LFs) in a weak supervision framework to obtain large labeled datasets. We further extend the use of LLMs in the loop to address one of the key challenges in weak supervision: learning the statistical dependency structure among supervision sources. In this work, we ask the LLM how similar are these prompted LFs. We propose a Structure Refining Module, a simple yet effective first approach based on the similarities of the prompts by taking advantage of the intrinsic structure in the embedding space. At the core of Structure Refining Module are Labeling Function Removal (LaRe) and Correlation Structure Generation (CosGen). Compared to previous methods that learn the dependencies from weak labels, our method finds the dependencies which are intrinsic to the LFs and less dependent on the data. We show that our Structure Refining Module improves the PromptedWS pipeline by up to 12.7 points on the benchmark tasks. We also explore the trade-offs between efficiency and performance with comprehensive ablation experiments and analysis. Code for this project can be found in https://github.com/BatsResearch/su-bigdata23-code.

Leveraging Large Language Models for Structure Learning in Prompted Weak Supervision

TL;DR

This work introduces the Structure Refining Module to enhance Prompted Weak Supervision by learning intrinsic dependencies among prompted LFs via LLM embeddings. It comprises LaRe, which removes redundant prompts, and CosGen, which constructs a sparse dependency graph used by the label model, all without requiring labeled data. Empirical results across three benchmarks show substantial gains over PromptedWS and competitive performance versus data-driven structure learning, with notable efficiency benefits. The approach demonstrates how embedding-based LF similarity can robustly identify correlations, enabling more accurate and scalable weak supervision pipelines using large language models.

Abstract

Prompted weak supervision (PromptedWS) applies pre-trained large language models (LLMs) as the basis for labeling functions (LFs) in a weak supervision framework to obtain large labeled datasets. We further extend the use of LLMs in the loop to address one of the key challenges in weak supervision: learning the statistical dependency structure among supervision sources. In this work, we ask the LLM how similar are these prompted LFs. We propose a Structure Refining Module, a simple yet effective first approach based on the similarities of the prompts by taking advantage of the intrinsic structure in the embedding space. At the core of Structure Refining Module are Labeling Function Removal (LaRe) and Correlation Structure Generation (CosGen). Compared to previous methods that learn the dependencies from weak labels, our method finds the dependencies which are intrinsic to the LFs and less dependent on the data. We show that our Structure Refining Module improves the PromptedWS pipeline by up to 12.7 points on the benchmark tasks. We also explore the trade-offs between efficiency and performance with comprehensive ablation experiments and analysis. Code for this project can be found in https://github.com/BatsResearch/su-bigdata23-code.
Paper Structure (13 sections, 4 figures, 8 tables, 1 algorithm)

This paper contains 13 sections, 4 figures, 8 tables, 1 algorithm.

Figures (4)

  • Figure 1: Prompted weak supervision workflow showing the Structure Refining Module as a plugin.
  • Figure 2: Visualization of the similarity matrix for labeling functions in the YouTube dataset compared with double faults, i.e., examples on which both labeling functions make a mistake.
  • Figure 3: Performance on different removal rate. The dashed lines are the prompted weak supervisions without any removal. We plot the results of removing 10%, 30%, 50%, 70% of the labeling functions.
  • Figure 4: Label model running time for LaRe (left side of the dashed vertical line; plotted using blue) and CosGen (right side of dashed vertical line; plotted with orange). We zoom in the LaRe (blue plots) to better show the tendency in the subplots. From left to right columns are plots for Youtube, SMS and Spouse respectively. The first row describes experiments with original set of prompted LFs and the second row describes experiments with augmented LFs.