Table of Contents
Fetching ...

NLPGuard: A Framework for Mitigating the Use of Protected Attributes by NLP Classifiers

Salvatore Greco, Ke Zhou, Licia Capra, Tania Cerquitelli, Daniele Quercia

TL;DR

NLPGuard is introduced, a framework for mitigating the reliance on protected attributes in NLP classifiers that effectively reduces this reliance by up to 79%, while slightly improving accuracy.

Abstract

AI regulations are expected to prohibit machine learning models from using sensitive attributes during training. However, the latest Natural Language Processing (NLP) classifiers, which rely on deep learning, operate as black-box systems, complicating the detection and remediation of such misuse. Traditional bias mitigation methods in NLP aim for comparable performance across different groups based on attributes like gender or race but fail to address the underlying issue of reliance on protected attributes. To partly fix that, we introduce NLPGuard, a framework for mitigating the reliance on protected attributes in NLP classifiers. NLPGuard takes an unlabeled dataset, an existing NLP classifier, and its training data as input, producing a modified training dataset that significantly reduces dependence on protected attributes without compromising accuracy. NLPGuard is applied to three classification tasks: identifying toxic language, sentiment analysis, and occupation classification. Our evaluation shows that current NLP classifiers heavily depend on protected attributes, with up to $23\%$ of the most predictive words associated with these attributes. However, NLPGuard effectively reduces this reliance by up to $79\%$, while slightly improving accuracy.

NLPGuard: A Framework for Mitigating the Use of Protected Attributes by NLP Classifiers

TL;DR

NLPGuard is introduced, a framework for mitigating the reliance on protected attributes in NLP classifiers that effectively reduces this reliance by up to 79%, while slightly improving accuracy.

Abstract

AI regulations are expected to prohibit machine learning models from using sensitive attributes during training. However, the latest Natural Language Processing (NLP) classifiers, which rely on deep learning, operate as black-box systems, complicating the detection and remediation of such misuse. Traditional bias mitigation methods in NLP aim for comparable performance across different groups based on attributes like gender or race but fail to address the underlying issue of reliance on protected attributes. To partly fix that, we introduce NLPGuard, a framework for mitigating the reliance on protected attributes in NLP classifiers. NLPGuard takes an unlabeled dataset, an existing NLP classifier, and its training data as input, producing a modified training dataset that significantly reduces dependence on protected attributes without compromising accuracy. NLPGuard is applied to three classification tasks: identifying toxic language, sentiment analysis, and occupation classification. Our evaluation shows that current NLP classifiers heavily depend on protected attributes, with up to of the most predictive words associated with these attributes. However, NLPGuard effectively reduces this reliance by up to , while slightly improving accuracy.
Paper Structure (42 sections, 12 figures, 5 tables)

This paper contains 42 sections, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Toxicity probabilities P(T) to four sentences predicted as toxic by a classifier. The first three sentences are misclassified, while the last is correctly identified.
  • Figure 2: Words impacting the toxicity classification of the four sentences in Figure \ref{['tab:motivation-example-original-model']}. The more intense the red (blue) color of a word, the more important the word contributes to toxic (non-toxic) classification.
  • Figure 3: Our Mitigation Framework. It takes the original NLP classifier, the original training dataset, and a new unlabeled corpus as input. The framework consists of three components: A) an Explainer that identifies the most important words used by the classifier for predictions on the unlabeled corpus; B) an Identifier that determines which of those words are protected attributes; and C) a Moderator that generates a mitigated training dataset to re-train the classifier so to reduce reliance on the previously identified protected attributes.
  • Figure 4: Explainer component evaluation. F1 score decrease by removing the most important words from the test set, extracted by the Explainer component with Integrated Gradients (IG) and SHAP techniques. A greater decrease indicates a higher precision in identifying the most important words for predictions.
  • Figure 5: Identifier component evaluation. Cohen’s kappa annotator agreement in labeling protected attributes for the 400 most toxic words. The annotation was performed by two expert annotators (A1 and A2), ChatGPT (GPT), MTurk (MT), and a pre-defined dictionary (D). The two-by-two Cohen’s kappa annotator agreement is reported, in the range [0, 1], where a higher score indicates a higher level of agreement.
  • ...and 7 more figures