Table of Contents
Fetching ...

RoleConflictBench: A Benchmark of Role Conflict Scenarios for Evaluating LLMs' Contextual Sensitivity

Jisu Shin, Hoyun Song, Juhyun Oh, Changgeon Ko, Eunsu Kim, Chani Jung, Alice Oh

TL;DR

RoleConflictBench introduces a three-stage pipeline to generate 13,914 realistic role-conflict stories across 65 roles, enabling evaluation of LLMs’ contextual sensitivity to social dilemmas. Through querying with binary role-priority decisions and analyzing responses with a Sensitivity score $S$ and a Bradley–Terry-based Role Priority Index, the paper reveals that current LLMs exhibit limited context sensitivity and strong, domain-skewed biases toward Family and Occupation roles, with notable gender and Abrahamic-religion biases. The study demonstrates that situational urgency only weakly modulates decisions compared with ingrained role preferences, and that demographic cues can disproportionately steer outputs, indicating severe risks for real-world advisory and simulation tasks. The benchmark offers reproducible data, transparent prompts, and actionable metrics for diagnosing contextual sensitivity and social biases, informing safer and more equitable AI systems for decision-making.

Abstract

Humans often encounter role conflicts -- social dilemmas where the expectations of multiple roles clash and cannot be simultaneously fulfilled. As large language models (LLMs) become increasingly influential in human decision-making, understanding how they behave in complex social situations is essential. While previous research has evaluated LLMs' social abilities in contexts with predefined correct answers, role conflicts represent inherently ambiguous social dilemmas that require contextual sensitivity: the ability to recognize and appropriately weigh situational cues that can fundamentally alter decision priorities. To address this gap, we introduce RoleConflictBench, a novel benchmark designed to evaluate LLMs' contextual sensitivity in complex social dilemmas. Our benchmark employs a three-stage pipeline to generate over 13K realistic role conflict scenarios across 65 roles, systematically varying their associated expectations (i.e., their responsibilities and obligations) and situational urgency levels. By analyzing model choices across 10 different LLMs, we find that while LLMs show some capacity to respond to these contextual cues, this sensitivity is insufficient. Instead, their decisions are predominantly governed by a powerful, inherent bias related to social roles rather than situational information. Our analysis quantifies these biases, revealing a dominant preference for roles within the Family and Occupation domains, as well as a clear prioritization of male roles and Abrahamic religions across most evaluatee models.

RoleConflictBench: A Benchmark of Role Conflict Scenarios for Evaluating LLMs' Contextual Sensitivity

TL;DR

RoleConflictBench introduces a three-stage pipeline to generate 13,914 realistic role-conflict stories across 65 roles, enabling evaluation of LLMs’ contextual sensitivity to social dilemmas. Through querying with binary role-priority decisions and analyzing responses with a Sensitivity score and a Bradley–Terry-based Role Priority Index, the paper reveals that current LLMs exhibit limited context sensitivity and strong, domain-skewed biases toward Family and Occupation roles, with notable gender and Abrahamic-religion biases. The study demonstrates that situational urgency only weakly modulates decisions compared with ingrained role preferences, and that demographic cues can disproportionately steer outputs, indicating severe risks for real-world advisory and simulation tasks. The benchmark offers reproducible data, transparent prompts, and actionable metrics for diagnosing contextual sensitivity and social biases, informing safer and more equitable AI systems for decision-making.

Abstract

Humans often encounter role conflicts -- social dilemmas where the expectations of multiple roles clash and cannot be simultaneously fulfilled. As large language models (LLMs) become increasingly influential in human decision-making, understanding how they behave in complex social situations is essential. While previous research has evaluated LLMs' social abilities in contexts with predefined correct answers, role conflicts represent inherently ambiguous social dilemmas that require contextual sensitivity: the ability to recognize and appropriately weigh situational cues that can fundamentally alter decision priorities. To address this gap, we introduce RoleConflictBench, a novel benchmark designed to evaluate LLMs' contextual sensitivity in complex social dilemmas. Our benchmark employs a three-stage pipeline to generate over 13K realistic role conflict scenarios across 65 roles, systematically varying their associated expectations (i.e., their responsibilities and obligations) and situational urgency levels. By analyzing model choices across 10 different LLMs, we find that while LLMs show some capacity to respond to these contextual cues, this sensitivity is insufficient. Instead, their decisions are predominantly governed by a powerful, inherent bias related to social roles rather than situational information. Our analysis quantifies these biases, revealing a dominant preference for roles within the Family and Occupation domains, as well as a clear prioritization of male roles and Abrahamic religions across most evaluatee models.

Paper Structure

This paper contains 47 sections, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Conceptual illustration of RoleConflictBench. We generate distinct expectations for two competing social roles and synthesize them into a story depicting an individual's role conflict. Our benchmark is designed to evaluate how decisions change depending on the situation.
  • Figure 2: Story generation pipeline of RoleConflictBench. An LLM serves as a generator to synthesize a first-person story depicting a role conflict.
  • Figure 3: Win ratio of each role, conditioned on its urgency level relative to its opponent. The lines show the win ratio when a role's urgency level is higher (●), equal (▲), or lower ($\times$) than its opponent's. Roles on the x-axis are sorted by their overall role priority index.
  • Figure 5: Value statistics cited in the reasoning paths of GPT-4.1 for justifying its role preferences across different social domains. The results show associations between specific roles and values.
  • Figure 6: Rankings ordered by role priority index.
  • ...and 5 more figures