Debiasing Algorithm through Model Adaptation

Tomasz Limisiewicz; David Mareček; Tomáš Musil

Debiasing Algorithm through Model Adaptation

Tomasz Limisiewicz, David Mareček, Tomáš Musil

TL;DR

<3-5 sentence high-level summary>

Abstract

Large language models are becoming the go-to solution for the ever-growing number of tasks. However, with growing capacity, models are prone to rely on spurious correlations stemming from biases and stereotypes present in the training data. This work proposes a novel method for detecting and mitigating gender bias in language models. We perform causal analysis to identify problematic model components and discover that mid-upper feed-forward layers are most prone to convey bias. Based on the analysis results, we intervene in the model by applying a linear projection to the weight matrices of these layers. Our titular method, DAMA, significantly decreases bias as measured by diverse metrics while maintaining the model's performance on downstream tasks. We release code for our method and models, which retrain LLaMA's state-of-the-art performance while being significantly less biased.

Debiasing Algorithm through Model Adaptation

TL;DR

<3-5 sentence high-level summary>

Abstract

Paper Structure (48 sections, 2 theorems, 15 equations, 12 figures, 4 tables)

This paper contains 48 sections, 2 theorems, 15 equations, 12 figures, 4 tables.

Introduction
Methodology and Experimental Setup
LLaMA Models
Gender Bias Evaluation in Language Generation
Other Gender Bias Indicators
WinoBias
StereoSet
Language Modeling
Downstream Tasks
Bias Evaluation and Causal Tracing
Experiments
Bias Evaluation
Causal Tracing
Results
Bias Evaluation
...and 33 more sections

Key Result

Theorem 1

Assume that we have a linear subspace $\mathcal{C} \subseteq \mathbb{R}^{o}$. Given a n-element key matrix $U \in \mathbb{R}^{i \times n}$ a value matrix $V \in \mathbb{R}^{o\times n}$, we search a mapping matrix $W \in \mathbb{R}^{o \times i}$ minimizing the least squares and satisfying $\forall_{i This equation is solved by: Where $P_c$ is a projection matrix on a subspace $\mathcal{C}$.

Figures (12)

Figure 1: Schema (b) shows DAMA intervention in a LLaMA layer. Even though $\mathbb{I} - P_c$ is depicted as a separate module, in practice, it is multiplied with the output matrix of a feed-forward layer ($W_{FF}$). Therefore, DAMA is neutral to the model's parameter count and architecture. (a) We show the behavior of the model when presented with a stereotypical prompt. Specifically, (c) shows the projections of the feed-forward latent vector ($\vec{u}$) onto the output space. With DAMA (lower arrow), we nullify the gender component of the representation. It results in balanced probabilities of gendered tokens in the model's output, as shown in (d).
Figure 2: Causal tracing of factual $a_f$, stereotypical $a_s$ coefficients and intercept$b$ in regression to indirect effects of the model $y_{IE}$. The linear models are independently fitted for restored MLPclean representation at each layer and token position.
Figure 3: The effect of applying DAMA to LLaMA 7B model on performance and bais in language modeling. We measured bias on gendered prompts (Section \ref{['sec:bias-eval-lm']}) by linear coefficients: $a_s$ and $b$ coefficient, the causal language modeling capabilities are measured by perplexity. Stars mark the performance of the model picked for further evaluation. The dashed line corresponds to the scores of the original LLaMA 7B model.
Figure 4: LLaMA 7B. Gender factual and stereotypical coefficients for linear regression to indirect effects of the model $y_{IE}$. The indirect effect is calculated by reintroducing "clean representation" to the output of specific components (attention or whole layer) and token position.
Figure 5: LLaMA 13B
...and 7 more figures

Theorems & Definitions (3)

Theorem 1
Theorem 2: Ordinary Least Square Problem
proof

Debiasing Algorithm through Model Adaptation

TL;DR

Abstract

Debiasing Algorithm through Model Adaptation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (3)