Table of Contents
Fetching ...

Diagnosing the Performance Trade-off in Moral Alignment: A Case Study on Gender Stereotypes

Guangliang Liu, Bocheng Chen, Han Zi, Xitong Zhang, Kristen Marie Johnson

TL;DR

The paper investigates why gender stereotype mitigation via moral alignment degrades downstream performance in pretrained language models. It analyzes forgetting as the central mechanism and uses Counterfactual Data Augmentation to enforce anti-stereotypical associations while examining the effect on overall forgetting. Key findings show that downstream performance correlates with overall forgetting; selective forgetting reduces stereotypes but does not lessen forgetting, and common forgetting-mitigation strategies fail to improve trade-offs. The work highlights the need for pragmatics-aware approaches and fusion strategies to balance fairness and utility in moral alignment tasks.

Abstract

Moral alignment has emerged as a widely adopted approach for regulating the behavior of pretrained language models (PLMs), typically through fine-tuning on curated datasets. Gender stereotype mitigation is a representational task within the broader application of moral alignment. However, this process often comes at the cost of degraded downstream task performance. Prior studies commonly aim to achieve a performance trade-off by encouraging PLMs to selectively forget only stereotypical knowledge through carefully designed fairness objective, while preserving their language modeling capability (overall forgetting). In this short paper, we investigate whether the performance trade-off can be achieved through the lens of forgetting and the fairness objective. Our analysis shows that the large datasets needed for satisfactory fairness highlight the limitations of current fairness objectives in achieving an effective trade-off: (1) downstream task performance is strongly correlated with overall forgetting; (2) selective forgetting reduces stereotypes, but overall forgetting increases. and (3) general solutions for alleviating forgetting are ineffective at reducing the overall forgetting and fail to improve downstream task performance.

Diagnosing the Performance Trade-off in Moral Alignment: A Case Study on Gender Stereotypes

TL;DR

The paper investigates why gender stereotype mitigation via moral alignment degrades downstream performance in pretrained language models. It analyzes forgetting as the central mechanism and uses Counterfactual Data Augmentation to enforce anti-stereotypical associations while examining the effect on overall forgetting. Key findings show that downstream performance correlates with overall forgetting; selective forgetting reduces stereotypes but does not lessen forgetting, and common forgetting-mitigation strategies fail to improve trade-offs. The work highlights the need for pragmatics-aware approaches and fusion strategies to balance fairness and utility in moral alignment tasks.

Abstract

Moral alignment has emerged as a widely adopted approach for regulating the behavior of pretrained language models (PLMs), typically through fine-tuning on curated datasets. Gender stereotype mitigation is a representational task within the broader application of moral alignment. However, this process often comes at the cost of degraded downstream task performance. Prior studies commonly aim to achieve a performance trade-off by encouraging PLMs to selectively forget only stereotypical knowledge through carefully designed fairness objective, while preserving their language modeling capability (overall forgetting). In this short paper, we investigate whether the performance trade-off can be achieved through the lens of forgetting and the fairness objective. Our analysis shows that the large datasets needed for satisfactory fairness highlight the limitations of current fairness objectives in achieving an effective trade-off: (1) downstream task performance is strongly correlated with overall forgetting; (2) selective forgetting reduces stereotypes, but overall forgetting increases. and (3) general solutions for alleviating forgetting are ineffective at reducing the overall forgetting and fail to improve downstream task performance.

Paper Structure

This paper contains 13 sections, 1 equation, 8 figures, 1 table.

Figures (8)

  • Figure 1: Fairness Objective Motivation. Application of CDA creates the desired association between occupation and gender resulting in an anti-stereotypical corpus. However, such fine-tuning introduces the undesired association of the gender with a neutral phrase.
  • Figure 2: StereoSet Score (Left), Overall Forgetting (Middle) and SST Performance (Right) over Fine-tuning Epochs of BERT. Mechanistic analysis reveals the effects of forgetting and the fairness objective in facilitating gender stereotype mitigation and on downstream SST performance. It is apparent that the StereoSet score is dominated by both forgetting and fairness objective, though the forgetting itself can contribute to satisfactory fairness. The SST performance is governed by forgetting. Additional results for Llama are in Appendix \ref{['app:llama']}.
  • Figure 3: Experimental results with variant size of fine-tuning dataset with BERT. We take 10K samples from $\mathcal{D}_f$ and consider different size of ${\mathcal{D}_n}$ by following webster2020measuring. Left: StereoSet score, Middle: Forgetting and Right: SST performance. It is evident that increasing the amount of fine-tuning data leads to greater forgetting and poorer SST performance. Additional results for Llama are in Appendix \ref{['app:llama']}.
  • Figure 4: Experimental Results with KL-based Regularization for BERT. Left: StereoSet score; Middle: overall forgetting; Right: SST performance. It is apparent the different levels of regularization lead to similar overall forgetting and SST performance though they caused slight differences in the StereoSet score. Additional results for Llama are in Appendix \ref{['app:llama']}.
  • Figure 5: Gender Information in Neutral Phrases Acquired from the Debiased BERT Model.
  • ...and 3 more figures