Diagnosing the Performance Trade-off in Moral Alignment: A Case Study on Gender Stereotypes
Guangliang Liu, Bocheng Chen, Han Zi, Xitong Zhang, Kristen Marie Johnson
TL;DR
The paper investigates why gender stereotype mitigation via moral alignment degrades downstream performance in pretrained language models. It analyzes forgetting as the central mechanism and uses Counterfactual Data Augmentation to enforce anti-stereotypical associations while examining the effect on overall forgetting. Key findings show that downstream performance correlates with overall forgetting; selective forgetting reduces stereotypes but does not lessen forgetting, and common forgetting-mitigation strategies fail to improve trade-offs. The work highlights the need for pragmatics-aware approaches and fusion strategies to balance fairness and utility in moral alignment tasks.
Abstract
Moral alignment has emerged as a widely adopted approach for regulating the behavior of pretrained language models (PLMs), typically through fine-tuning on curated datasets. Gender stereotype mitigation is a representational task within the broader application of moral alignment. However, this process often comes at the cost of degraded downstream task performance. Prior studies commonly aim to achieve a performance trade-off by encouraging PLMs to selectively forget only stereotypical knowledge through carefully designed fairness objective, while preserving their language modeling capability (overall forgetting). In this short paper, we investigate whether the performance trade-off can be achieved through the lens of forgetting and the fairness objective. Our analysis shows that the large datasets needed for satisfactory fairness highlight the limitations of current fairness objectives in achieving an effective trade-off: (1) downstream task performance is strongly correlated with overall forgetting; (2) selective forgetting reduces stereotypes, but overall forgetting increases. and (3) general solutions for alleviating forgetting are ineffective at reducing the overall forgetting and fail to improve downstream task performance.
