Table of Contents
Fetching ...

FAIRGAMER: Evaluating Social Biases in LLM-Based Video Game NPCs

Bingkang Shi, Jen-tse Huang, Long Luo, Tianyu Zong, Hongzhu Yi, Yuanxiang Wang, Songlin Hu, Xiaodan Zhang, Zhongjiang Yao

TL;DR

FairGamer introduces the first benchmark to quantify social biases in LLM-driven game NPCs across three interaction patterns and four bias types, using a novel multivariate fairness metric (FairMCV). By mapping NPC interactions to game-theoretic settings (bargaining, allocation, and zero-sum competition) and assembling a large bilingual dataset, the approach reveals that bias is an intrinsic model property and can be amplified by larger models or difficult interaction regimes. Chain-of-Thought debiasing provides partial mitigation, but substantial bias persists, underscoring the need for robust debiasing or post-training interventions in game AI. The work offers a practical, data-driven framework for evaluating and improving NPC fairness, with implications for fair gameplay and user experience in diverse virtual worlds.

Abstract

Large Language Models (LLMs) have increasingly enhanced or replaced traditional Non-Player Characters (NPCs) in video games. However, these LLM-based NPCs inherit underlying social biases (e.g., race or class), posing fairness risks during in-game interactions. To address the limited exploration of this issue, we introduce FairGamer, the first benchmark to evaluate social biases across three interaction patterns: transaction, cooperation, and competition. FairGamer assesses four bias types, including class, race, age, and nationality, across 12 distinct evaluation tasks using a novel metric, FairMCV. Our evaluation of seven frontier LLMs reveals that: (1) models exhibit biased decision-making, with Grok-4-Fast demonstrating the highest bias (average FairMCV = 76.9%); and (2) larger LLMs display more severe social biases, suggesting that increased model capacity inadvertently amplifies these biases. We release FairGamer at https://github.com/Anonymous999-xxx/FairGamer to facilitate future research on NPC fairness.

FAIRGAMER: Evaluating Social Biases in LLM-Based Video Game NPCs

TL;DR

FairGamer introduces the first benchmark to quantify social biases in LLM-driven game NPCs across three interaction patterns and four bias types, using a novel multivariate fairness metric (FairMCV). By mapping NPC interactions to game-theoretic settings (bargaining, allocation, and zero-sum competition) and assembling a large bilingual dataset, the approach reveals that bias is an intrinsic model property and can be amplified by larger models or difficult interaction regimes. Chain-of-Thought debiasing provides partial mitigation, but substantial bias persists, underscoring the need for robust debiasing or post-training interventions in game AI. The work offers a practical, data-driven framework for evaluating and improving NPC fairness, with implications for fair gameplay and user experience in diverse virtual worlds.

Abstract

Large Language Models (LLMs) have increasingly enhanced or replaced traditional Non-Player Characters (NPCs) in video games. However, these LLM-based NPCs inherit underlying social biases (e.g., race or class), posing fairness risks during in-game interactions. To address the limited exploration of this issue, we introduce FairGamer, the first benchmark to evaluate social biases across three interaction patterns: transaction, cooperation, and competition. FairGamer assesses four bias types, including class, race, age, and nationality, across 12 distinct evaluation tasks using a novel metric, FairMCV. Our evaluation of seven frontier LLMs reveals that: (1) models exhibit biased decision-making, with Grok-4-Fast demonstrating the highest bias (average FairMCV = 76.9%); and (2) larger LLMs display more severe social biases, suggesting that increased model capacity inadvertently amplifies these biases. We release FairGamer at https://github.com/Anonymous999-xxx/FairGamer to facilitate future research on NPC fairness.

Paper Structure

This paper contains 28 sections, 6 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 1: Illustration of our evaluation process. (a) Transaction, cooperation, and competition are three fundamental modes of interaction between an LLM and any NPC in a game. (b) After observing the identity information of itself and the interacting NPC, the LLM generates biased decisions during the interaction.
  • Figure 2: Overview of the FairGamer evaluation method. (A) Demographic Info Injection: Game rules and choices are defined based on the interaction mode, and socially-biased role attributes are assigned to both interacting parties (e.g., $\text{role}_\text{self}$="Barbarian" and $\text{role}_\text{obs}$="Bards"). (B) FairMCV Computation: The 1D/3D distribution of LLM outputs is obtained through repeated sampling, based on which the FairMCV score is calculated.
  • Figure 3: FairMCV provides a unified scalar measure for the dispersion of decision vector distributions, irrespective of their dimensions.
  • Figure 4: FairMCV Score of DeepSeek-V3.2 at different temperatures in FairGamer.
  • Figure 5: FairMCV Score of DeepSeek-V3.2 using different prompt templates in FairGamer.
  • ...and 11 more figures