Table of Contents
Fetching ...

Locating and Mitigating Gender Bias in Large Language Models

Yuchen Cai, Ding Cao, Rongxi Guo, Yaqin Wen, Guiquan Liu, Enhong Chen

TL;DR

The LSDM (Least Square Debias Method), a knowledge-editing based method for mitigating gender bias in occupational pronouns, is proposed and compared against two baselines on three gender bias datasets and seven knowledge competency test datasets, indicating that the primary contributors to gender bias are the bottom MLP modules acting on the last token of occupational pronouns.

Abstract

Large language models(LLM) are pre-trained on extensive corpora to learn facts and human cognition which contain human preferences. However, this process can inadvertently lead to these models acquiring biases and stereotypes prevalent in society. Prior research has typically tackled the issue of bias through a one-dimensional perspective, concentrating either on locating or mitigating it. This limited perspective has created obstacles in facilitating research on bias to synergistically complement and progressively build upon one another. In this study, we integrate the processes of locating and mitigating bias within a unified framework. Initially, we use causal mediation analysis to trace the causal effects of different components' activation within a large language model. Building on this, we propose the LSDM (Least Square Debias Method), a knowledge-editing based method for mitigating gender bias in occupational pronouns, and compare it against two baselines on three gender bias datasets and seven knowledge competency test datasets. The experimental results indicate that the primary contributors to gender bias are the bottom MLP modules acting on the last token of occupational pronouns and the top attention module acting on the final word in the sentence. Furthermore, LSDM mitigates gender bias in the model more effectively than the other baselines, while fully preserving the model's capabilities in all other aspects.

Locating and Mitigating Gender Bias in Large Language Models

TL;DR

The LSDM (Least Square Debias Method), a knowledge-editing based method for mitigating gender bias in occupational pronouns, is proposed and compared against two baselines on three gender bias datasets and seven knowledge competency test datasets, indicating that the primary contributors to gender bias are the bottom MLP modules acting on the last token of occupational pronouns.

Abstract

Large language models(LLM) are pre-trained on extensive corpora to learn facts and human cognition which contain human preferences. However, this process can inadvertently lead to these models acquiring biases and stereotypes prevalent in society. Prior research has typically tackled the issue of bias through a one-dimensional perspective, concentrating either on locating or mitigating it. This limited perspective has created obstacles in facilitating research on bias to synergistically complement and progressively build upon one another. In this study, we integrate the processes of locating and mitigating bias within a unified framework. Initially, we use causal mediation analysis to trace the causal effects of different components' activation within a large language model. Building on this, we propose the LSDM (Least Square Debias Method), a knowledge-editing based method for mitigating gender bias in occupational pronouns, and compare it against two baselines on three gender bias datasets and seven knowledge competency test datasets. The experimental results indicate that the primary contributors to gender bias are the bottom MLP modules acting on the last token of occupational pronouns and the top attention module acting on the final word in the sentence. Furthermore, LSDM mitigates gender bias in the model more effectively than the other baselines, while fully preserving the model's capabilities in all other aspects.
Paper Structure (16 sections, 19 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 16 sections, 19 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Causal Trace computes the causal effect of neurons activation by running the network three times: (1) once normally, (2) once where we corrupt the embedding, (3) once where we corrupt the embedding and then restore selected internal activation to their clean value. (a) The clean run procedure is depicted on the left side. (b) The restoration process is presented on the right side, measures the impact of $h_1^1$ among the generation of gender bias by corrupting in embedding and restoring the $h_1^1$ of the first layer.
  • Figure 2: The upper and lower parts correspond to the causal trace results of GPT2-XL and GPT-J-6B. The horizontal axis represents the different layers of the model, while the vertical axis signifies the meanings of various tokens within a sentence. The color of the graph indicates the magnitude of the corresponding AIE. For the MLP and attention modules, considering the minimal influence that altering a single MLP or attention has on the model, we concurrently adjust the MLP (attention) values across a suite of 10 layers at once, centered on layer $l$, including the 5 layers above and 4 layers below it, as with Meng et al meng2022locating.
  • Figure 3: To isolate the effects of MLP (attention) modules when measuring causal effects, the computation graph is modified.
  • Figure 5: LSDM applied in a single $W_{proj}$, modifying the parameters of $W_{proj}$ by finding the vector $k$ and the corresponding unbiased vector $v^*$.