Table of Contents
Fetching ...

Debiasing Text Safety Classifiers through a Fairness-Aware Ensemble

Olivia Sturman, Aparna Joshi, Bhaktipriya Radharapu, Piyush Kumar, Renee Shelby

TL;DR

A light-weight, post-processing method for mitigating counterfactual fairness in closed-source text safety classifiers, which involves building an ensemble that not only outperforms the input classifiers and policy-aligns them, but also acts as a debiasing regularizer.

Abstract

Increasing use of large language models (LLMs) demand performant guardrails to ensure the safety of inputs and outputs of LLMs. When these safeguards are trained on imbalanced data, they can learn the societal biases. We present a light-weight, post-processing method for mitigating counterfactual fairness in closed-source text safety classifiers. Our approach involves building an ensemble that not only outperforms the input classifiers and policy-aligns them, but also acts as a debiasing regularizer. We introduce two threshold-agnostic metrics to assess the counterfactual fairness of a model, and demonstrate how combining these metrics with Fair Data Reweighting (FDW) helps mitigate biases. We create an expanded Open AI dataset, and a new templated LLM-generated dataset based on user-prompts, both of which are counterfactually balanced across identity groups and cover four key areas of safety; we will work towards publicly releasing these datasets. Our results show that our approach improves counterfactual fairness with minimal impact on model performance.

Debiasing Text Safety Classifiers through a Fairness-Aware Ensemble

TL;DR

A light-weight, post-processing method for mitigating counterfactual fairness in closed-source text safety classifiers, which involves building an ensemble that not only outperforms the input classifiers and policy-aligns them, but also acts as a debiasing regularizer.

Abstract

Increasing use of large language models (LLMs) demand performant guardrails to ensure the safety of inputs and outputs of LLMs. When these safeguards are trained on imbalanced data, they can learn the societal biases. We present a light-weight, post-processing method for mitigating counterfactual fairness in closed-source text safety classifiers. Our approach involves building an ensemble that not only outperforms the input classifiers and policy-aligns them, but also acts as a debiasing regularizer. We introduce two threshold-agnostic metrics to assess the counterfactual fairness of a model, and demonstrate how combining these metrics with Fair Data Reweighting (FDW) helps mitigate biases. We create an expanded Open AI dataset, and a new templated LLM-generated dataset based on user-prompts, both of which are counterfactually balanced across identity groups and cover four key areas of safety; we will work towards publicly releasing these datasets. Our results show that our approach improves counterfactual fairness with minimal impact on model performance.
Paper Structure (17 sections, 5 figures, 8 tables, 1 algorithm)

This paper contains 17 sections, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of our debiasing approach: the ensemble is a small model whose input features constitute the output attributes of source models, and is trained on a small dataset to output policy-aligned predictions.
  • Figure 2: An illustration of our two-stage debiasing approach. We use a combination of OpenAI and our LLM generated datasets as train, test, and validation sets. We provide SA metrics of our baseline on the held-out validation set as an input to FDW that outputs a reweighted dataset to train a counterfactually fairer model. We introduce four hyper-parameters per harm ($\lambda_{\text{Safe}}$, $\lambda_{\text{Unsafe}}$, $\beta_{\text{Safe}}$, $\beta_{\text{Unsafe}}$) to tune the data re-sampling per slice to balance between model fairness and performance.
  • Figure 3: SA for Hate (or equivalent) source model attributes $identity\_attack$ and $sensitive\_topic$ (in blue text), and our baseline and debiased ensembles for the group Race, on Safe examples. Cell values reflect average classification score: green (low) to blue (high). Uniform rows of color mean less bias.
  • Figure 4: Depiction of reduced feature contribution percentage of biased source model attributes identity_attack and sensitive_topic in the debiased model compared to the baseline for Hate. Attributes with less than 5% feature contribution are excluded from the diagram.
  • Figure 5: On the y-axis, we plot the average of max gaps between $SAs$ for Hate across identity categories for an attribute. The top 3 features of the baseline model are depicted in red, and those of the remediated model are depicted in blue. Lower placement on the y-axis indicates lesser bias for that attribute. Represented by stars, we also plot the max gaps between $SAs$ for the models' Hate scores overall, illustrating how the presence of bias attributes can significantly impact a model's overall bias, particularly for being heavily reliant on such attributes.