Table of Contents
Fetching ...

On the Effectiveness and Generalization of Race Representations for Debiasing High-Stakes Decisions

Dang Nguyen, Chenhao Tan

TL;DR

Admissions and Hiring, decision tasks with hypothetical applicant profiles where a person's race can be inferred from their name, are introduced as simplified test beds for racial bias, and mechanistic approaches may provide a promising venue for improving the fairness of LLMs.

Abstract

Understanding and mitigating biases is critical for the adoption of large language models (LLMs) in high-stakes decision-making. We introduce Admissions and Hiring, decision tasks with hypothetical applicant profiles where a person's race can be inferred from their name, as simplified test beds for racial bias. We show that Gemma 2B Instruct and LLaMA 3.2 3B Instruct exhibit strong biases. Gemma grants admission to 26% more White than Black applicants, and LLaMA hires 60% more Asian than White applicants. We demonstrate that these biases are resistant to prompt engineering: multiple prompting strategies all fail to promote fairness. In contrast, using distributed alignment search, we can identify "race subspaces" within model activations and intervene on them to debias model decisions. Averaging the representation across all races within the subspaces reduces Gemma's bias by 37-57%. Finally, we examine the generalizability of Gemma's race subspaces, and find limited evidence for generalization, where changing the prompt format can affect the race representation. Our work suggests mechanistic approaches may provide a promising venue for improving the fairness of LLMs, but a universal race representation remains elusive.

On the Effectiveness and Generalization of Race Representations for Debiasing High-Stakes Decisions

TL;DR

Admissions and Hiring, decision tasks with hypothetical applicant profiles where a person's race can be inferred from their name, are introduced as simplified test beds for racial bias, and mechanistic approaches may provide a promising venue for improving the fairness of LLMs.

Abstract

Understanding and mitigating biases is critical for the adoption of large language models (LLMs) in high-stakes decision-making. We introduce Admissions and Hiring, decision tasks with hypothetical applicant profiles where a person's race can be inferred from their name, as simplified test beds for racial bias. We show that Gemma 2B Instruct and LLaMA 3.2 3B Instruct exhibit strong biases. Gemma grants admission to 26% more White than Black applicants, and LLaMA hires 60% more Asian than White applicants. We demonstrate that these biases are resistant to prompt engineering: multiple prompting strategies all fail to promote fairness. In contrast, using distributed alignment search, we can identify "race subspaces" within model activations and intervene on them to debias model decisions. Averaging the representation across all races within the subspaces reduces Gemma's bias by 37-57%. Finally, we examine the generalizability of Gemma's race subspaces, and find limited evidence for generalization, where changing the prompt format can affect the race representation. Our work suggests mechanistic approaches may provide a promising venue for improving the fairness of LLMs, but a universal race representation remains elusive.

Paper Structure

This paper contains 23 sections, 3 equations, 15 figures, 16 tables.

Figures (15)

  • Figure 1: We consider two decision tasks: Admissions and Hiring, and examine two approaches to control model behavior. In prompt engineering, we attempt to debias by adding various instructions. To modify model internals, (1) we learn a "race subspace" to intervene so that the target decision becomes the counterfactual decision (i.e., as if the inferred race is the same as that from the source text). (2) We debias model decisions by averaging the representation in the race subspace across a batch of samples, which removes the variance in race representations between applicants.
  • Figure 2: Gemma and LLaMA shows biases in Admissions and Hiring. Results are averaged over 5 trials each with 10,000 applicant profiles.
  • Figure 3: Alignment training test results. (a) IIAs across the alignment search window. There is strong race representation on the final token. (b) Subspace interchange intervention outperforms baselines at the best-IIA layers (10 for Admissions and 12 for Hiring).
  • Figure 4: Gemma Admissions prompt. The names are suggestive of race, sampled from the list below.
  • Figure 5: Gemma Hiring prompt. The names are suggestive of race, sampled from the list below.
  • ...and 10 more figures