Table of Contents
Fetching ...

Evaluating and Mitigating Discrimination in Language Model Decisions

Alex Tamkin, Amanda Askell, Liane Lovitt, Esin Durmus, Nicholas Joseph, Shauna Kravec, Karina Nguyen, Jared Kaplan, Deep Ganguli

TL;DR

The paper presents a proactive, scalable method to assess discrimination risk in language-model decisions by generating 70 diverse hypothetical decision prompts and varying demographics explicitly or via names. It demonstrates that Claude 2.0 exhibits both positive and negative discrimination prior to mitigation, with explicit demographic signals producing stronger effects than implicit ones. The authors show that prompt-based mitigations, including anti-discrimination prompts and thinking-aloud requests, can dramatically reduce discrimination while preserving decision relevance, validated through human checks and a mixed-effects model. They release their dataset and prompts to enable broader auditing by developers and policymakers, emphasizing cautious, sociotechnical deployment in real-world high-stakes scenarios.

Abstract

As language models (LMs) advance, interest is growing in applying them to high-stakes societal decisions, such as determining financing or housing eligibility. However, their potential for discrimination in such contexts raises ethical concerns, motivating the need for better methods to evaluate these risks. We present a method for proactively evaluating the potential discriminatory impact of LMs in a wide range of use cases, including hypothetical use cases where they have not yet been deployed. Specifically, we use an LM to generate a wide array of potential prompts that decision-makers may input into an LM, spanning 70 diverse decision scenarios across society, and systematically vary the demographic information in each prompt. Applying this methodology reveals patterns of both positive and negative discrimination in the Claude 2.0 model in select settings when no interventions are applied. While we do not endorse or permit the use of language models to make automated decisions for the high-risk use cases we study, we demonstrate techniques to significantly decrease both positive and negative discrimination through careful prompt engineering, providing pathways toward safer deployment in use cases where they may be appropriate. Our work enables developers and policymakers to anticipate, measure, and address discrimination as language model capabilities and applications continue to expand. We release our dataset and prompts at https://huggingface.co/datasets/Anthropic/discrim-eval

Evaluating and Mitigating Discrimination in Language Model Decisions

TL;DR

The paper presents a proactive, scalable method to assess discrimination risk in language-model decisions by generating 70 diverse hypothetical decision prompts and varying demographics explicitly or via names. It demonstrates that Claude 2.0 exhibits both positive and negative discrimination prior to mitigation, with explicit demographic signals producing stronger effects than implicit ones. The authors show that prompt-based mitigations, including anti-discrimination prompts and thinking-aloud requests, can dramatically reduce discrimination while preserving decision relevance, validated through human checks and a mixed-effects model. They release their dataset and prompts to enable broader auditing by developers and policymakers, emphasizing cautious, sociotechnical deployment in real-world high-stakes scenarios.

Abstract

As language models (LMs) advance, interest is growing in applying them to high-stakes societal decisions, such as determining financing or housing eligibility. However, their potential for discrimination in such contexts raises ethical concerns, motivating the need for better methods to evaluate these risks. We present a method for proactively evaluating the potential discriminatory impact of LMs in a wide range of use cases, including hypothetical use cases where they have not yet been deployed. Specifically, we use an LM to generate a wide array of potential prompts that decision-makers may input into an LM, spanning 70 diverse decision scenarios across society, and systematically vary the demographic information in each prompt. Applying this methodology reveals patterns of both positive and negative discrimination in the Claude 2.0 model in select settings when no interventions are applied. While we do not endorse or permit the use of language models to make automated decisions for the high-risk use cases we study, we demonstrate techniques to significantly decrease both positive and negative discrimination through careful prompt engineering, providing pathways toward safer deployment in use cases where they may be appropriate. Our work enables developers and policymakers to anticipate, measure, and address discrimination as language model capabilities and applications continue to expand. We release our dataset and prompts at https://huggingface.co/datasets/Anthropic/discrim-eval
Paper Structure (42 sections, 1 equation, 6 figures, 2 tables)

This paper contains 42 sections, 1 equation, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of our method for measuring discrimination in language model decisions. We first generate decision topics (e.g, "insurance decisions") and then generate full questions a decision-maker might ask a model about that topic, with placeholders for age, race, and gender. We ensure a "yes" response is a positive outcome for the subject of the question. We then fill those placeholders with different values and evaluate whether the LM's probability of "yes" is significantly higher for some demographics compared to others. See \ref{['appendix:prompts']} for the full set of prompts we use.
  • Figure 2: Patterns of positive and negative discrimination in Claude. Discrimination score for different demographic attributes and ways of populating the templates with those attributes (see \ref{['sec:question-generation', 'subsec:discrimination-score']}). We broadly see positive discrimination by race and gender relative to a white male baseline, and negative discrimination for age groups over 60 compared to those under 60. Discrimination is higher for explicit demographic attributes (e.g., "Black male") and lower but still positive for names (e.g., "Jalen Washington").
  • Figure 3: Patterns of discrimination are mostly similar across decision questions. Discrimination scores (see \ref{['subsec:discrimination-score']}) for different decision questions (e.g., granting a visa, providing security clearance) and demographics (age and Black, relative to the white 60-year-old baseline). Without intervention, the model typically exhibits neutral or negative discrimination with respect to age, while exhibiting positive discrimination for Black over white candidates for these decision questions. Results shown here are for prompts filled with Explicit demographic attributes (see \ref{['sec:question-generation']}).
  • Figure 4: The style in which the decision question is written does not affect the direction of discrimination across templates. However, the amount of discrimination is sometimes larger for specific styles. For example, the magnitude of the discrimination score is generally larger when the prompts are written in an emotional style (\ref{['prompt:emotional']}).
  • Figure 5: Prompt-based interventions can significantly reduce the discrimination score. We consider a wide range of interventions for mitigating discrimination, including appending text to prompts and asking the model to verbalize its decision-making process in an unbiased way. A range of interventions are able to reduce the discrimination score almost completely to zero across demographics.
  • ...and 1 more figures