Table of Contents
Fetching ...

Weak Supervision Performance Evaluation via Partial Identification

Felipe Maia Polo, Subha Maity, Mikhail Yurochkin, Moulinath Banerjee, Yuekai Sun

TL;DR

This work presents a novel method to address model evaluation as a partial identification problem and estimating performance bounds using Fr\'echet bounds, which derives reliable bounds on key metrics without requiring labeled data, overcoming core limitations in current weak supervision evaluation techniques.

Abstract

Programmatic Weak Supervision (PWS) enables supervised model training without direct access to ground truth labels, utilizing weak labels from heuristics, crowdsourcing, or pre-trained models. However, the absence of ground truth complicates model evaluation, as traditional metrics such as accuracy, precision, and recall cannot be directly calculated. In this work, we present a novel method to address this challenge by framing model evaluation as a partial identification problem and estimating performance bounds using Fréchet bounds. Our approach derives reliable bounds on key metrics without requiring labeled data, overcoming core limitations in current weak supervision evaluation techniques. Through scalable convex optimization, we obtain accurate and computationally efficient bounds for metrics including accuracy, precision, recall, and F1-score, even in high-dimensional settings. This framework offers a robust approach to assessing model quality without ground truth labels, enhancing the practicality of weakly supervised learning for real-world applications.

Weak Supervision Performance Evaluation via Partial Identification

TL;DR

This work presents a novel method to address model evaluation as a partial identification problem and estimating performance bounds using Fr\'echet bounds, which derives reliable bounds on key metrics without requiring labeled data, overcoming core limitations in current weak supervision evaluation techniques.

Abstract

Programmatic Weak Supervision (PWS) enables supervised model training without direct access to ground truth labels, utilizing weak labels from heuristics, crowdsourcing, or pre-trained models. However, the absence of ground truth complicates model evaluation, as traditional metrics such as accuracy, precision, and recall cannot be directly calculated. In this work, we present a novel method to address this challenge by framing model evaluation as a partial identification problem and estimating performance bounds using Fréchet bounds. Our approach derives reliable bounds on key metrics without requiring labeled data, overcoming core limitations in current weak supervision evaluation techniques. Through scalable convex optimization, we obtain accurate and computationally efficient bounds for metrics including accuracy, precision, recall, and F1-score, even in high-dimensional settings. This framework offers a robust approach to assessing model quality without ground truth labels, enhancing the practicality of weakly supervised learning for real-world applications.
Paper Structure (34 sections, 11 theorems, 51 equations, 6 figures, 5 tables)

This paper contains 34 sections, 11 theorems, 51 equations, 6 figures, 5 tables.

Key Result

Theorem 2.1

Let $g:\mathcal{X} \times \mathcal{Y} \times \mathcal{Z} \to {\mathbb{R}}$ be a bounded measurable function. Then, where Moreover, $L$ and $U$ are attained by some optimizers in $\mathcal{A}$.

Figures (6)

  • Figure 1: We apply our method to bound test metrics such as accuracy and F1 score (in green) when no true labels are used to estimate performance. In the first row ("Oracle"), we use true labels to estimate the conditional distribution $P_{Y\mid Z}$, thus approximating a scenario in which the label model is reasonably specified. On the second row ("Snorkel"), we use a label model to estimate $P_{Y\mid Z}$ without access to any true labels. Despite potential misspecification in Snorkel's label model, it performs comparably to using labels to estimate $P_{Y\mid Z}$, giving approximate but meaningful bounds.
  • Figure 2: Precision and recall bounds for hate speech detection.
  • Figure 3: Performance bounds for classifiers on the YouTube dataset, initially relying solely on few-shot weak labels obtained via prompts to the LLM Llama-2-13b-chat-hf. The progression of plots illustrates the comparative impact of integrating "high-quality" labels from Wrench versus synthetically generated "low-quality" labels. Evidently, the addition of "high-quality" labels significantly enhances the bounds, underscoring their superior utility over "low-quality" labels for optimal classification of SPAM and HAM comments.
  • Figure 4: Bounds on classifier accuracies across classification thresholds for the Wrench datasets. Despite potential misspecification in Snorkel's and FlyingSquid's label model, it performs comparably to using labels to estimate $P_{Y\mid Z}$, giving approximate but meaningful bounds. .
  • Figure 5: Bounds on classifier accuracies and F1 scores across classification thresholds for the Wrench datasets (using the full set of weak labels). Despite potential misspecification in Snorkel's and FlyingSquid's label model, it performs comparably to using labels to estimate $P_{Y\mid Z}$, giving approximate but meaningful bounds.
  • ...and 1 more figures

Theorems & Definitions (22)

  • Theorem 2.1
  • Theorem 2.5
  • Corollary 3.1
  • Theorem 5.1
  • Theorem 5.2
  • proof : Proof of Theorem \ref{['thm:dual_problem']}
  • proof : Proof of Theorem \ref{['thm:clt']}
  • proof : Proof of Corollary \ref{['cor:precision-recall']}
  • Lemma C.1
  • proof
  • ...and 12 more