Table of Contents
Fetching ...

Generative Language Models Exhibit Social Identity Biases

Tiancheng Hu, Yara Kyrychenko, Steve Rathje, Nigel Collier, Sander van der Linden, Jon Roozenbeek

TL;DR

This work investigates whether ingroup solidarity and outgroup hostility, fundamental social identity biases known from social psychology, are present in 56 large language models and finds that almost all foundational language models and some instruction fine- tuned models exhibit clear ingroup-positive and outgroup-negative associations when prompted to complete sentences.

Abstract

The surge in popularity of large language models has given rise to concerns about biases that these models could learn from humans. We investigate whether ingroup solidarity and outgroup hostility, fundamental social identity biases known from social psychology, are present in 56 large language models. We find that almost all foundational language models and some instruction fine-tuned models exhibit clear ingroup-positive and outgroup-negative associations when prompted to complete sentences (e.g., "We are..."). Our findings suggest that modern language models exhibit fundamental social identity biases to a similar degree as humans, both in the lab and in real-world conversations with LLMs, and that curating training data and instruction fine-tuning can mitigate such biases. Our results have practical implications for creating less biased large-language models and further underscore the need for more research into user interactions with LLMs to prevent potential bias reinforcement in humans.

Generative Language Models Exhibit Social Identity Biases

TL;DR

This work investigates whether ingroup solidarity and outgroup hostility, fundamental social identity biases known from social psychology, are present in 56 large language models and finds that almost all foundational language models and some instruction fine- tuned models exhibit clear ingroup-positive and outgroup-negative associations when prompted to complete sentences.

Abstract

The surge in popularity of large language models has given rise to concerns about biases that these models could learn from humans. We investigate whether ingroup solidarity and outgroup hostility, fundamental social identity biases known from social psychology, are present in 56 large language models. We find that almost all foundational language models and some instruction fine-tuned models exhibit clear ingroup-positive and outgroup-negative associations when prompted to complete sentences (e.g., "We are..."). Our findings suggest that modern language models exhibit fundamental social identity biases to a similar degree as humans, both in the lab and in real-world conversations with LLMs, and that curating training data and instruction fine-tuning can mitigate such biases. Our results have practical implications for creating less biased large-language models and further underscore the need for more research into user interactions with LLMs to prevent potential bias reinforcement in humans.
Paper Structure (15 sections, 4 equations, 10 figures, 22 tables)

This paper contains 15 sections, 4 equations, 10 figures, 22 tables.

Figures (10)

  • Figure 1: Study 1: Ingroup sentences produced by base LLMs are about twice more likely to be positive (vs. negative or neutral) than outgroup sentences, while outgroup sentences are about twice as likely to be negative (controlling for sentence length and the number of unique words). a Social identity biases in base LLMs. b Models with exceptionally high levels of outgroup hostility. c Social identity biases in instruction fine-tuned LLMs with sentence samples produced by the instruction prompt. d Ingroup solidarity and outgroup hostility in human data obtained from four different pretraining corpora.
  • Figure 2: Study 2: Ingroup solidarity and outgroup hostility biases in fine-tuned language models on partisan social media data. a Both biases increase after fine-tuning models with US partisan Twitter data, but outgroup hostility increases more: outgroup sentences are almost seven times more likely to be negative than ingroup sentences. b The sentiment of ingroup and outgroup sentences generated by BLOOM 1.1B before (left) and after (right) fine-tuning with Republican Twitter data.
  • Figure 3: Study 2: a Ingroup solidarity and outgroup hostility measures for Republican and Democrat models after removing different proportions of positive and negative ingroup and outgroup sentences from training data. b Comparison of sentiment in ingroup and outgroup sentences generated by the GPT-2 base model, a model fine-tuned on Republican-affiliated Twitter data, and variants fine-tuned without either ingroup positive or outgroup negative sentences, or both.
  • Figure 4: Diagnostic values for stm models with different topic numbers.
  • Figure 5: Proportions of topics found by the stm model in the corpus generated by the non-finetuned models.
  • ...and 5 more figures