Table of Contents
Fetching ...

How Are LLMs Mitigating Stereotyping Harms? Learning from Search Engine Studies

Alina Leidinger, Richard Rogers

TL;DR

This work scrutinizes stereotyping harms in open-ended LLM generation using an autocomplete-style benchmark inspired by search engine studies. By probing seven instruction-tuned models with prompts across 170+ social groups and evaluating with four metrics (refusal, toxicity, sentiment, regard), the study reveals that safety prompts provide partial mitigation but fail to comprehensively address stereotyping, especially for ethnicity, sexuality, and intersectional identities. Inter-model differences are pronounced: some models (e.g., Llama-2, Starling) show stronger refusals and more favorable sentiment/regard, while others (notably Falcon) exhibit higher toxicity and weaker moderation. Additionally, forcing autocompletion formatting without chat templates often amplifies toxic stereotyping, underscoring the fragility of current safety regimes when integrated into search-like contexts. The paper argues for accountability and diverse social-impact measures in leaderboards and auditing, highlighting the need for explicit handling of intersectional harms and transparent safety-training practices to better align AI systems with social welfare goals.

Abstract

With the widespread availability of LLMs since the release of ChatGPT and increased public scrutiny, commercial model development appears to have focused their efforts on 'safety' training concerning legal liabilities at the expense of social impact evaluation. This mimics a similar trend which we could observe for search engine autocompletion some years prior. We draw on scholarship from NLP and search engine auditing and present a novel evaluation task in the style of autocompletion prompts to assess stereotyping in LLMs. We assess LLMs by using four metrics, namely refusal rates, toxicity, sentiment and regard, with and without safety system prompts. Our findings indicate an improvement to stereotyping outputs with the system prompt, but overall a lack of attention by LLMs under study to certain harms classified as toxic, particularly for prompts about peoples/ethnicities and sexual orientation. Mentions of intersectional identities trigger a disproportionate amount of stereotyping. Finally, we discuss the implications of these findings about stereotyping harms in light of the coming intermingling of LLMs and search and the choice of stereotyping mitigation policy to adopt. We address model builders, academics, NLP practitioners and policy makers, calling for accountability and awareness concerning stereotyping harms, be it for training data curation, leader board design and usage, or social impact measurement.

How Are LLMs Mitigating Stereotyping Harms? Learning from Search Engine Studies

TL;DR

This work scrutinizes stereotyping harms in open-ended LLM generation using an autocomplete-style benchmark inspired by search engine studies. By probing seven instruction-tuned models with prompts across 170+ social groups and evaluating with four metrics (refusal, toxicity, sentiment, regard), the study reveals that safety prompts provide partial mitigation but fail to comprehensively address stereotyping, especially for ethnicity, sexuality, and intersectional identities. Inter-model differences are pronounced: some models (e.g., Llama-2, Starling) show stronger refusals and more favorable sentiment/regard, while others (notably Falcon) exhibit higher toxicity and weaker moderation. Additionally, forcing autocompletion formatting without chat templates often amplifies toxic stereotyping, underscoring the fragility of current safety regimes when integrated into search-like contexts. The paper argues for accountability and diverse social-impact measures in leaderboards and auditing, highlighting the need for explicit handling of intersectional harms and transparent safety-training practices to better align AI systems with social welfare goals.

Abstract

With the widespread availability of LLMs since the release of ChatGPT and increased public scrutiny, commercial model development appears to have focused their efforts on 'safety' training concerning legal liabilities at the expense of social impact evaluation. This mimics a similar trend which we could observe for search engine autocompletion some years prior. We draw on scholarship from NLP and search engine auditing and present a novel evaluation task in the style of autocompletion prompts to assess stereotyping in LLMs. We assess LLMs by using four metrics, namely refusal rates, toxicity, sentiment and regard, with and without safety system prompts. Our findings indicate an improvement to stereotyping outputs with the system prompt, but overall a lack of attention by LLMs under study to certain harms classified as toxic, particularly for prompts about peoples/ethnicities and sexual orientation. Mentions of intersectional identities trigger a disproportionate amount of stereotyping. Finally, we discuss the implications of these findings about stereotyping harms in light of the coming intermingling of LLMs and search and the choice of stereotyping mitigation policy to adopt. We address model builders, academics, NLP practitioners and policy makers, calling for accountability and awareness concerning stereotyping harms, be it for training data curation, leader board design and usage, or social impact measurement.
Paper Structure (52 sections, 15 figures, 8 tables)

This paper contains 52 sections, 15 figures, 8 tables.

Figures (15)

  • Figure 1: Average refusal rates (rule-based classifier)
  • Figure 2: Sentiment scores per category with chat template
  • Figure 3: Regard scores per category with chat template
  • Figure 4: Average refusal rates (rule-based classifier) for male/female genders, peoples/ethnicities, and intersections
  • Figure 5: Average refusal rates per category with and without system prompt
  • ...and 10 more figures