Table of Contents
Fetching ...

FAIR: Filtering of Automatically Induced Rules

Divya Jyoti Bajpai, Ayush Maheshwari, Manjesh Kumar Hanawal, Ganesh Ramakrishnan

TL;DR

Fair tackles the problem of selecting a high-quality, diverse subset of automatically induced labeling rules for weak supervision in text classification. It frames rule selection as a submodular optimization problem, introducing the Graph-Cut objective $f_{GC}$ to capture precision, coverage, and inter-rule agreement, and compares it with a non-submodular $f_{PCA}$ baseline. Through extensive experiments on five datasets and multiple semi-supervised aggregators, Fair GC consistently improves end-model macro-F1 scores and achieves statistically significant gains over existing ARI-filtering approaches. The method highlights the importance of modeling interdependencies among labeling functions to reduce label noise and enhance downstream performance, while acknowledging limitations related to noisier rule pools and computational costs.

Abstract

The availability of large annotated data can be a critical bottleneck in training machine learning algorithms successfully, especially when applied to diverse domains. Weak supervision offers a promising alternative by accelerating the creation of labeled training data using domain-specific rules. However, it requires users to write a diverse set of high-quality rules to assign labels to the unlabeled data. Automatic Rule Induction (ARI) approaches circumvent this problem by automatically creating rules from features on a small labeled set and filtering a final set of rules from them. In the ARI approach, the crucial step is to filter out a set of a high-quality useful subset of rules from the large set of automatically created rules. In this paper, we propose an algorithm (Filtering of Automatically Induced Rules) to filter rules from a large number of automatically induced rules using submodular objective functions that account for the collective precision, coverage, and conflicts of the rule set. We experiment with three ARI approaches and five text classification datasets to validate the superior performance of our algorithm with respect to several semi-supervised label aggregation approaches. Further, we show that achieves statistically significant results in comparison to existing rule-filtering approaches.

FAIR: Filtering of Automatically Induced Rules

TL;DR

Fair tackles the problem of selecting a high-quality, diverse subset of automatically induced labeling rules for weak supervision in text classification. It frames rule selection as a submodular optimization problem, introducing the Graph-Cut objective to capture precision, coverage, and inter-rule agreement, and compares it with a non-submodular baseline. Through extensive experiments on five datasets and multiple semi-supervised aggregators, Fair GC consistently improves end-model macro-F1 scores and achieves statistically significant gains over existing ARI-filtering approaches. The method highlights the importance of modeling interdependencies among labeling functions to reduce label noise and enhance downstream performance, while acknowledging limitations related to noisier rule pools and computational costs.

Abstract

The availability of large annotated data can be a critical bottleneck in training machine learning algorithms successfully, especially when applied to diverse domains. Weak supervision offers a promising alternative by accelerating the creation of labeled training data using domain-specific rules. However, it requires users to write a diverse set of high-quality rules to assign labels to the unlabeled data. Automatic Rule Induction (ARI) approaches circumvent this problem by automatically creating rules from features on a small labeled set and filtering a final set of rules from them. In the ARI approach, the crucial step is to filter out a set of a high-quality useful subset of rules from the large set of automatically created rules. In this paper, we propose an algorithm (Filtering of Automatically Induced Rules) to filter rules from a large number of automatically induced rules using submodular objective functions that account for the collective precision, coverage, and conflicts of the rule set. We experiment with three ARI approaches and five text classification datasets to validate the superior performance of our algorithm with respect to several semi-supervised label aggregation approaches. Further, we show that achieves statistically significant results in comparison to existing rule-filtering approaches.
Paper Structure (19 sections, 3 equations, 8 figures, 12 tables, 2 algorithms)

This paper contains 19 sections, 3 equations, 8 figures, 12 tables, 2 algorithms.

Figures (8)

  • Figure 1: The flow of our approach. We first generate rules in the candidate rule generation block and then filter them using different respective approaches (such as with Snubasnuba, Graspshnarch2017Grasp and Classifier weights) as also with Fair. The final committed rule set is passed on to the semi-supervised label aggregation approaches for the final performance on the downstream task.
  • Figure 2: Results for IMDB and TREC dataset for Fair GC against Snuba and M-Grasp.
  • Figure 3: Comparison of Fair against Snuba and M-Grasp filtering over different label aggregation approaches, GC is Fair GraphCut. The size of the final committed set is the same across all ARI approaches.
  • Figure 4: Results on YouTube dataset for Fair GC
  • Figure 5: Comparison of Fair GC against Classifier weights on different datasets.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Example A.1