Table of Contents
Fetching ...

The Unequal Opportunities of Large Language Models: Revealing Demographic Bias through Job Recommendations

Abel Salinas, Parth Vipul Shah, Yuzhong Huang, Robert McCormack, Fred Morstatter

TL;DR

This work addresses how large language models can propagate demographic biases into downstream decisions by analyzing job recommendations. It introduces a template-based framework to induce and measure bias across nationality and gender identity, applied to ChatGPT and LLaMA with 50 generations per prompt and BERTopic clustering to organize the resulting job titles. The analysis reveals model- and prompt-dependent biases, including pronounced effects for Mexican nationalities and substantial gender-nature interactions, alongside salary patterns that often mirror but occasionally diverge from real-world labor data. The findings underscore the need for careful prompt design and bias mitigation to prevent discriminatory outcomes in practical NLP applications, and motivate development of robust benchmarks beyond brittle template-based tests.

Abstract

Large Language Models (LLMs) have seen widespread deployment in various real-world applications. Understanding these biases is crucial to comprehend the potential downstream consequences when using LLMs to make decisions, particularly for historically disadvantaged groups. In this work, we propose a simple method for analyzing and comparing demographic bias in LLMs, through the lens of job recommendations. We demonstrate the effectiveness of our method by measuring intersectional biases within ChatGPT and LLaMA, two cutting-edge LLMs. Our experiments primarily focus on uncovering gender identity and nationality bias; however, our method can be extended to examine biases associated with any intersection of demographic identities. We identify distinct biases in both models toward various demographic identities, such as both models consistently suggesting low-paying jobs for Mexican workers or preferring to recommend secretarial roles to women. Our study highlights the importance of measuring the bias of LLMs in downstream applications to understand the potential for harm and inequitable outcomes.

The Unequal Opportunities of Large Language Models: Revealing Demographic Bias through Job Recommendations

TL;DR

This work addresses how large language models can propagate demographic biases into downstream decisions by analyzing job recommendations. It introduces a template-based framework to induce and measure bias across nationality and gender identity, applied to ChatGPT and LLaMA with 50 generations per prompt and BERTopic clustering to organize the resulting job titles. The analysis reveals model- and prompt-dependent biases, including pronounced effects for Mexican nationalities and substantial gender-nature interactions, alongside salary patterns that often mirror but occasionally diverge from real-world labor data. The findings underscore the need for careful prompt design and bias mitigation to prevent discriminatory outcomes in practical NLP applications, and motivate development of robust benchmarks beyond brittle template-based tests.

Abstract

Large Language Models (LLMs) have seen widespread deployment in various real-world applications. Understanding these biases is crucial to comprehend the potential downstream consequences when using LLMs to make decisions, particularly for historically disadvantaged groups. In this work, we propose a simple method for analyzing and comparing demographic bias in LLMs, through the lens of job recommendations. We demonstrate the effectiveness of our method by measuring intersectional biases within ChatGPT and LLaMA, two cutting-edge LLMs. Our experiments primarily focus on uncovering gender identity and nationality bias; however, our method can be extended to examine biases associated with any intersection of demographic identities. We identify distinct biases in both models toward various demographic identities, such as both models consistently suggesting low-paying jobs for Mexican workers or preferring to recommend secretarial roles to women. Our study highlights the importance of measuring the bias of LLMs in downstream applications to understand the potential for harm and inequitable outcomes.
Paper Structure (29 sections, 2 equations, 8 figures, 7 tables)

This paper contains 29 sections, 2 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Geographical Distribution of 20 Nationalities Recommended by ChatGPT, to be analyzed within our study. Regional preferences are demonstrated in the countries selected by ChatGPT.
  • Figure 2: Visualization of the embedding space, in two dimensions using dimensionality reduction, showing the embeddings of all unique job titles returned by ChatGPT and LLaMA across three semantically-similar prompts. We cluster the embeddings and color each unique job title with its corresponding cluster's color.
  • Figure 3: Word cloud visualization of all job titles returned by ChatGPT and LLaMA for three semantically-similar prompts. Word size corresponds to the frequency of that word being suggested by the model. Color corresponds to the probability of that word being offered to a man versus a woman. (blue skews male, gold skews female).
  • Figure 4: Probabilities of each job type being offered, given each of our three prompts. These probabilities are computed from over 2000 generations, with varying combinations of nationality and gender identity.
  • Figure 5: Differences in the probability of a given job type to be offered to men versus women. We show these differences across each prompt for both ChatGPT and LLaMA. The male and female probabilities are computed from 1000 generations each, with varying combinations of nationality and gender identity.
  • ...and 3 more figures