Table of Contents
Fetching ...

Inducing Group Fairness in Prompt-Based Language Model Decisions

James Atwood, Nino Scherrer, Preethi Lahoti, Ananth Balashankar, Flavien Prost, Ahmad Beirami

TL;DR

The paper investigates equal opportunity fairness in two LM-based classification paradigms: prompt-based and embedding-based, finding significant group disparities in false positive rates across religious and other demographic groups. It adapts three remediation families—prompting, in-processing with a Min Diff/MMD regularizer, and post-processing with an emfairening head—and evaluates them on the Civil Comments Identity dataset. Embedding-based classifiers generally outperform prompt-based ones, with in-processing providing the strongest fairness–performance tradeoffs; prompting offers limited improvements. Post-processing shows transferability to unseen models, highlighting the potential for universal fairness heads, while prompting methods are less controllable. Overall, the work emphasizes the need for LM-structure-aware remediation and provides practical guidance on when to apply in-processing versus post-processing for fair LM-based decision making.

Abstract

Classifiers are used throughout industry to enforce policies, ranging from the detection of toxic content to age-appropriate content filtering. While these classifiers serve important functions, it is also essential that they are built in ways that minimize unfair biases for users. One such fairness consideration is called group fairness, which desires that different sub-population of users receive equal treatment. This is a well-studied problem in the context of 'classical' classifiers. However, the emergence of prompt-based language model (LM) decision making has created new opportunities to solve text-based classification tasks, and the fairness properties of these new classifiers are not yet well understood. Further, the `remediation toolkit' is incomplete for LM-based decision makers and little is understood about how to improve decision maker group fairness while maintaining classifier performance. This work sets out to add more tools to that toolbox. We introduce adaptations of existing effective approaches from the classical classifier fairness to the prompt-based classifier space. We also devise simple methods that take advantage of the new structure of prompt-based decision makers and operate at the prompt level. We compare these approaches empirically on real data. Our results suggest that adaptations of approaches that are effective for classical classifiers remain effective in the LM-based classifier environment. However, there is room for further exploration of prompt-based remediation methods (and other remediation methods that take advantage of LM structure).

Inducing Group Fairness in Prompt-Based Language Model Decisions

TL;DR

The paper investigates equal opportunity fairness in two LM-based classification paradigms: prompt-based and embedding-based, finding significant group disparities in false positive rates across religious and other demographic groups. It adapts three remediation families—prompting, in-processing with a Min Diff/MMD regularizer, and post-processing with an emfairening head—and evaluates them on the Civil Comments Identity dataset. Embedding-based classifiers generally outperform prompt-based ones, with in-processing providing the strongest fairness–performance tradeoffs; prompting offers limited improvements. Post-processing shows transferability to unseen models, highlighting the potential for universal fairness heads, while prompting methods are less controllable. Overall, the work emphasizes the need for LM-structure-aware remediation and provides practical guidance on when to apply in-processing versus post-processing for fair LM-based decision making.

Abstract

Classifiers are used throughout industry to enforce policies, ranging from the detection of toxic content to age-appropriate content filtering. While these classifiers serve important functions, it is also essential that they are built in ways that minimize unfair biases for users. One such fairness consideration is called group fairness, which desires that different sub-population of users receive equal treatment. This is a well-studied problem in the context of 'classical' classifiers. However, the emergence of prompt-based language model (LM) decision making has created new opportunities to solve text-based classification tasks, and the fairness properties of these new classifiers are not yet well understood. Further, the `remediation toolkit' is incomplete for LM-based decision makers and little is understood about how to improve decision maker group fairness while maintaining classifier performance. This work sets out to add more tools to that toolbox. We introduce adaptations of existing effective approaches from the classical classifier fairness to the prompt-based classifier space. We also devise simple methods that take advantage of the new structure of prompt-based decision makers and operate at the prompt level. We compare these approaches empirically on real data. Our results suggest that adaptations of approaches that are effective for classical classifiers remain effective in the LM-based classifier environment. However, there is room for further exploration of prompt-based remediation methods (and other remediation methods that take advantage of LM structure).

Paper Structure

This paper contains 25 sections, 4 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Classification flow diagrams for prompt-based and embedding-based classifiers. Decisions are encouraged via 'text wrappers' that nudge the LM to make a classification decision. (\ref{['fig:zero_shot_diagram']}) For prompt-based classifiers, we treat the wrapped text as a prefix and query the LM for two postfix tokens (such as 'Yes' or 'No') that represent positive and negative decisions. We apply a softmax to these scores to obtain a probability distribution over the classification result and use this for decision making. (\ref{['fig:fine_tuned_diagram']}) For embedding-based classifiers, we assume that the LM is 'introspective' and can supply its activations. We instead query the LM for the activations of its last layer to serve as an embedding. We collect those embeddings into a design matrix then fit a logistic regression model on that matrix and corresponding labels. The logistic regression model is then used for downstream decision making.
  • Figure 2: Pareto frontiers of different remediation techniques. The left plot shows the performance and fairness of prompt-based classifiers, and the middle plot of embedding-based classifiers. The unremediated classifier setting is denoted by a '+' and prompting-based remediation methods are denoted by single symbols. Note that the in-processing baseline is inapplicable to prompt-based classifiers. Each point for in-processing and post-processing is generated by setting different values for $\lambda$ in Equations (\ref{['eqn:in_processing_loss']}) and (\ref{['eqn:post_processing_loss']}) in the appendix. The dashed and solid lines give the Pareto frontier where performance can only be gained by sacrificing fairness, for post-processing and in-processing, respectively. The right plot gives the effect of model transfer. We fit a post-processing remediation model to the PaLM 2 S model then compare the effects of applying it to PaLM 2 S (native) versus the larger PaLM 2 L model (transfer). The lines give the Pareto frontier (solid for native and dashed for transfer).