Table of Contents
Fetching ...

Attention Speaks Volumes: Localizing and Mitigating Bias in Language Models

Rishabh Adiga, Besmira Nushi, Varun Chandrasekaran

TL;DR

This work proposes $\texttt{ATLAS}$ (Attention-based Targeted Layer Analysis and Scaling), a technique to localize bias to specific layers of the LLM by analyzing attention scores and then reduce bias by scaling attention in these biased layers.

Abstract

We explore the internal mechanisms of how bias emerges in large language models (LLMs) when provided with ambiguous comparative prompts: inputs that compare or enforce choosing between two or more entities without providing clear context for preference. Most approaches for bias mitigation focus on either post-hoc analysis or data augmentation. However, these are transient solutions, without addressing the root cause: the model itself. Numerous prior works show the influence of the attention module towards steering generations. We believe that analyzing attention is also crucial for understanding bias, as it provides insight into how the LLM distributes its focus across different entities and how this contributes to biased decisions. To this end, we first introduce a metric to quantify the LLM's preference for one entity over another. We then propose $\texttt{ATLAS}$ (Attention-based Targeted Layer Analysis and Scaling), a technique to localize bias to specific layers of the LLM by analyzing attention scores and then reduce bias by scaling attention in these biased layers. To evaluate our method, we conduct experiments across 3 datasets (BBQ, Crows-Pairs, and WinoGender) using $\texttt{GPT-2 XL}$ (1.5B), $\texttt{GPT-J}$ (6B), $\texttt{LLaMA-2}$ (7B) and $\texttt{LLaMA-3}$ (8B). Our experiments demonstrate that bias is concentrated in the later layers, typically around the last third. We also show how $\texttt{ATLAS}$ effectively mitigates bias through targeted interventions without compromising downstream performance and an average increase of only 0.82% in perplexity when the intervention is applied. We see an average improvement of 0.28 points in the bias score across all the datasets.

Attention Speaks Volumes: Localizing and Mitigating Bias in Language Models

TL;DR

This work proposes (Attention-based Targeted Layer Analysis and Scaling), a technique to localize bias to specific layers of the LLM by analyzing attention scores and then reduce bias by scaling attention in these biased layers.

Abstract

We explore the internal mechanisms of how bias emerges in large language models (LLMs) when provided with ambiguous comparative prompts: inputs that compare or enforce choosing between two or more entities without providing clear context for preference. Most approaches for bias mitigation focus on either post-hoc analysis or data augmentation. However, these are transient solutions, without addressing the root cause: the model itself. Numerous prior works show the influence of the attention module towards steering generations. We believe that analyzing attention is also crucial for understanding bias, as it provides insight into how the LLM distributes its focus across different entities and how this contributes to biased decisions. To this end, we first introduce a metric to quantify the LLM's preference for one entity over another. We then propose (Attention-based Targeted Layer Analysis and Scaling), a technique to localize bias to specific layers of the LLM by analyzing attention scores and then reduce bias by scaling attention in these biased layers. To evaluate our method, we conduct experiments across 3 datasets (BBQ, Crows-Pairs, and WinoGender) using (1.5B), (6B), (7B) and (8B). Our experiments demonstrate that bias is concentrated in the later layers, typically around the last third. We also show how effectively mitigates bias through targeted interventions without compromising downstream performance and an average increase of only 0.82% in perplexity when the intervention is applied. We see an average improvement of 0.28 points in the bias score across all the datasets.

Paper Structure

This paper contains 27 sections, 15 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Attention distribution at the last token across layers for entities (e.g. 'grandfather' vs. 'grandson' or 'fat' vs. 'slim') in prompts to reveal LLM biases. Most of the information about the entities is present around the last third of the LLM's layer depth, as indicated by the magnitude of attention scores in those layers. More details on this phenomenon for other models are present in Figure \ref{['fig:AttentionvsLayers_large']} in Appendix \ref{['App:Attention distribution']}
  • Figure 2: Atlas involves two main stages. Stage 1 involves identifying the most important layers that contribute towards biased outcomes. Stage 2 involves scaling the attention weights at that layer in a strategic manner so as to ensure bias mitigation. This approach is carried out for each prompt.
  • Figure 3: Localization is feasible. The approach detailed in Equation \ref{['eq:approach1b']} can help identify layers that contribute more to bias. We visualize the attention scores for all prompts in the age bias (left sub-figure) and nationality bias (right sub-figure) categories for GPT-J: notice that layers around layer 20 contribute the most (as indicated by the darker regions).
  • Figure 4: Scaling interventions successfully decreases bias. The interventions proposed in § \ref{['sec:Intervention']}, when applied to the top-$k$ most contributing layers (in comparison to other layers) results in the greatest bias ratio improvement (percentage decrease in bias ratio) across all bias categories considered in the BBQ dataset on GPT-J. This highlights the efficacy of the localization strategy detailed in § \ref{['sec:Localization']}.
  • Figure 5: Attention distribution at the last token across layers for entities
  • ...and 2 more figures