Table of Contents
Fetching ...

Us-vs-Them bias in Large Language Models

Tabia Tanzin Prama, Julia Witte Zimmerman, Christopher M. Danforth, Peter Sheridan Dodds

TL;DR

The paper investigates how 'us versus them' bias, grounded in Social Identity Theory, manifests in large language models and how persona cues modulate ingroup solidarity and outgroup hostility. It introduces a three-pronged methodology using sentiment dynamics, allotaxonometry, and embedding regression to quantify bias across architectures, then demonstrates substantial mitigation with the Ingroup-Outgroup Neutralization (ION) framework that combines fine-tuning and Direct Preference Optimization. The work reveals that persona conditioning yields measurable shifts in embedding space and semantic meaning of pronouns like 'we' and 'they', and shows that bias can be reduced while maintaining linguistic richness. These findings highlight the interplay between local context, model representations, and global cognitive tendencies, with practical implications for bias evaluation and targeted mitigation in future LLM development.

Abstract

This study investigates ``us versus them'' bias, as described by Social Identity Theory, in large language models (LLMs) under both default and persona-conditioned settings across multiple architectures (GPT-4.1, DeepSeek-3.1, Gemma-2.0, Grok-3.0, and LLaMA-3.1). Using sentiment dynamics, allotaxonometry, and embedding regression, we find consistent ingroup-positive and outgroup-negative associations across foundational LLMs. We find that adopting a persona systematically alters models' evaluative and affiliative language patterns. For the exemplar personas examined, conservative personas exhibit greater outgroup hostility, whereas liberal personas display stronger ingroup solidarity. Persona conditioning produces distinct clustering in embedding space and measurable semantic divergence, supporting the view that even abstract identity cues can shift models' linguistic behavior. Furthermore, outgroup-targeted prompts increased hostility bias by 1.19--21.76\% across models. These findings suggest that LLMs learn not only factual associations about social groups but also internalize and reproduce distinct ways of being, including attitudes, worldviews, and cognitive styles that are activated when enacting personas. We interpret these results as evidence of a multi-scale coupling between local context (e.g., the persona prompt), localizable representations (what the model ``knows''), and global cognitive tendencies (how it ``thinks''), which are at least reflected in the training data. Finally, we demonstrate ION, an ``us versus them'' bias mitigation approach using fine-tuning and direct preference optimization (DPO), which reduces sentiment divergence by up to 69\%, highlighting the potential for targeted mitigation strategies in future LLM development.

Us-vs-Them bias in Large Language Models

TL;DR

The paper investigates how 'us versus them' bias, grounded in Social Identity Theory, manifests in large language models and how persona cues modulate ingroup solidarity and outgroup hostility. It introduces a three-pronged methodology using sentiment dynamics, allotaxonometry, and embedding regression to quantify bias across architectures, then demonstrates substantial mitigation with the Ingroup-Outgroup Neutralization (ION) framework that combines fine-tuning and Direct Preference Optimization. The work reveals that persona conditioning yields measurable shifts in embedding space and semantic meaning of pronouns like 'we' and 'they', and shows that bias can be reduced while maintaining linguistic richness. These findings highlight the interplay between local context, model representations, and global cognitive tendencies, with practical implications for bias evaluation and targeted mitigation in future LLM development.

Abstract

This study investigates ``us versus them'' bias, as described by Social Identity Theory, in large language models (LLMs) under both default and persona-conditioned settings across multiple architectures (GPT-4.1, DeepSeek-3.1, Gemma-2.0, Grok-3.0, and LLaMA-3.1). Using sentiment dynamics, allotaxonometry, and embedding regression, we find consistent ingroup-positive and outgroup-negative associations across foundational LLMs. We find that adopting a persona systematically alters models' evaluative and affiliative language patterns. For the exemplar personas examined, conservative personas exhibit greater outgroup hostility, whereas liberal personas display stronger ingroup solidarity. Persona conditioning produces distinct clustering in embedding space and measurable semantic divergence, supporting the view that even abstract identity cues can shift models' linguistic behavior. Furthermore, outgroup-targeted prompts increased hostility bias by 1.19--21.76\% across models. These findings suggest that LLMs learn not only factual associations about social groups but also internalize and reproduce distinct ways of being, including attitudes, worldviews, and cognitive styles that are activated when enacting personas. We interpret these results as evidence of a multi-scale coupling between local context (e.g., the persona prompt), localizable representations (what the model ``knows''), and global cognitive tendencies (how it ``thinks''), which are at least reflected in the training data. Finally, we demonstrate ION, an ``us versus them'' bias mitigation approach using fine-tuning and direct preference optimization (DPO), which reduces sentiment divergence by up to 69\%, highlighting the potential for targeted mitigation strategies in future LLM development.

Paper Structure

This paper contains 32 sections, 10 equations, 20 figures, 16 tables, 1 algorithm.

Figures (20)

  • Figure 2: Allotaxonograph using rank-turbulence divergence (RTD) to compare in-group ('We are') and out-group ('They are') sentences generated by GPT-4o. The left panel shows the rank-rank histogram, while the right panel shows the rank-turbulence divergence graph to visualize the differences in word usage between the two phases. The left panel shows the rank-rank histogram, while the right panel shows the rank-turbulence divergence graph to visualize the differences in word usage between the two phases. The allotaxonograph compares ranked lists of types for two systems, $\omega_1$ and $\omega_2$, by first generating a merged list of types covering both systems and binning logarithmic rank-rank pairs $\log_{10} r_{\tau,1}, \log_{10} r_{\tau,2}$ across all types and uniformly in logarithmic space dodds2023allotaxonometry. The discrete, separated lines of boxes nearest to each bottom axis comprise words that appear only in one phase: 'exclusive types'. Moving up the histogram, two other distinct lines above the 'exclusive-type lines' correspond to words that appear once and twice in the other phase. The three horizontal bars in the lower right show system balances: the top bar indicates the balance of total counts of words (tokens) for each phase: 33% versus 67%. The middle bar shows the percentage of the combined lexicon (types) for the two phases that appear in each phase: 41% versus 83%. The bottom bar shows the percentage of words (types) on each phase that are exclusive: 42% and 71%. The rank-turbulence divergence graph on the right is based on contributions of each word to the divergence measure. An ordered list is presented by descending values of divergence. Words are arranged left and right and colored gray and blue based on which phase they are more prevalent in. Ranks for each word in both phases are shown: for example, $r_{\text{believers},1} = 30$ and $r_{\text{believers},2} = 5924.5$. The allotaxonograph limits the resolution of the divergence measure $\alpha$ to multiples of $1/12$; here, $\alpha = 1/3$.
  • Figure 3: Word shift graph of word frequencies in happiness of ingroup and outgroup sentences. Words are ranked by their percentage contribution to the change in average happiness, $\Phi_{\text{avg}}$. The ingroup (We are) sentences are set as the reference text $T_\textnormal{ref}$, with the respective outgroup (They are) sentences as the comparison text $T_\textnormal{comp}$. Individual word contributions to the shift are indicated by two symbols: $+/-$ shows the word is more/less prevalent in $T_\textnormal{comp}$ than in $T_\textnormal{ref}$ . Black and gray fonts encode the $+$ and $-$ distinctions, respectively. The left inset panel shows how the ranked 3,686 labMT 1.0 words combine (word rank $r$ is shown on a log scale). The four bar on the top indicate the total contribution of the four types of words $( + \uparrow, + \downarrow, - \uparrow, -\downarrow)$. Relative text size is represented by the areas of the gray squares Gallagher2020GeneralizedWS
  • Figure 4: Ingroup solidarity ($\mu_{\text{ingroup}}$) and outgroup hostility ($\mu_{\text{outgroup}}$) across LLMs prompted with two partisan personas (Conservative and Liberal) in the U.S. political context.
  • Figure 5: Estimated significant coefficients of Ingroup bias (higher the coefficient higher ingroup solidarity) and outgroup bias (higher the coefficent higher the outgroup hostility ) per LLMs. We estimate these changes with respective regression coefficients for three persona(conservatives, Liberal and default) where ingroup bias score is the 'us versus them' score of "we are" senteces and outgroup bias is "they are" sentences. All estimated bias coefficients are significant (p-values below 0.001). The differences in the numbers of sentences per LLMs do not affect these results, since the estimates are computed per sentences.
  • Figure 6: Differences in word meaning across pairwise persona comparisons: C vs L (Conservative vs Liberal), C vs D (Conservative vs Default), and L vs D (Liberal vs Default). Reported values correspond to the norm of $\beta$ coefficients with bootstrap confidence intervals. *** indicates statistical significance at the 0.01 level.
  • ...and 15 more figures