MIRAGE: Exploring How Large Language Models Perform in Complex Social Interactive Environments

Yin Cai; Zhouhong Gu; Zhaohan Du; Zheyu Ye; Shaosheng Cao; Yiqian Xu; Hongwei Feng; Ping Chen

MIRAGE: Exploring How Large Language Models Perform in Complex Social Interactive Environments

Yin Cai, Zhouhong Gu, Zhaohan Du, Zheyu Ye, Shaosheng Cao, Yiqian Xu, Hongwei Feng, Ping Chen

TL;DR

MIRAGE introduces a comprehensive framework to evaluate large language models (LLMs) in complex social interactive environments using eight murder-mystery scripts. It defines four objective metrics—$TII$, $CIC$, $ICI$, and $SCI$—to quantify trust dynamics, information gathering, interactive capability, and script compliance, respectively. Across both open-source and proprietary models, experiments reveal persistent challenges for LLMs in nuanced social reasoning and show unequal strengths across models, with findings influenced by context length and evaluation bias. The framework provides a publicly available dataset and code, enabling standardized comparisons and paving the way for improved assessment of LLM social decision-making in narrative-driven settings.

Abstract

Large Language Models (LLMs) have shown remarkable capabilities in environmental perception, reasoning-based decision-making, and simulating complex human behaviors, particularly in interactive role-playing contexts. This paper introduces the Multiverse Interactive Role-play Ability General Evaluation (MIRAGE), a comprehensive framework designed to assess LLMs' proficiency in portraying advanced human behaviors through murder mystery games. MIRAGE features eight intricately crafted scripts encompassing diverse themes and styles, providing a rich simulation. To evaluate LLMs' performance, MIRAGE employs four distinct methods: the Trust Inclination Index (TII) to measure dynamics of trust and suspicion, the Clue Investigation Capability (CIC) to measure LLMs' capability of conducting information, the Interactivity Capability Index (ICI) to assess role-playing capabilities and the Script Compliance Index (SCI) to assess LLMs' capability of understanding and following instructions. Our experiments indicate that even popular models like GPT-4 face significant challenges in navigating the complexities presented by the MIRAGE. The datasets and simulation codes are available in \href{https://github.com/lime728/MIRAGE}{github}.

MIRAGE: Exploring How Large Language Models Perform in Complex Social Interactive Environments

TL;DR

, and

—to quantify trust dynamics, information gathering, interactive capability, and script compliance, respectively. Across both open-source and proprietary models, experiments reveal persistent challenges for LLMs in nuanced social reasoning and show unequal strengths across models, with findings influenced by context length and evaluation bias. The framework provides a publicly available dataset and code, enabling standardized comparisons and paving the way for improved assessment of LLM social decision-making in narrative-driven settings.

Abstract

Paper Structure (22 sections, 2 equations, 8 figures, 29 tables)

This paper contains 22 sections, 2 equations, 8 figures, 29 tables.

Introduction
MIRAGE Construction
Scripts Construction
Simulation Construction
Auxiliary Modules
Evaluation Methods
Statistics
Experiment
Experiment Setup
Analysis
Conclusion
Ablation Study
Computational methods of Evaluation methods
TII
CIC
...and 7 more sections

Figures (8)

Figure 1: The three main phase of MIRAGE. And the main components in these phases.
Figure 2: CIC of Clues and Key Clues on 100 Rounds of MIRAGE using Qwen-2-7B
Figure 3: ICI of Single & Multi Type Scripts
Figure 4: SCI of Single & Multi Type Scripts
Figure 5: ICI of Orthodox & Unorthodox Type Scripts
...and 3 more figures

MIRAGE: Exploring How Large Language Models Perform in Complex Social Interactive Environments

TL;DR

Abstract

MIRAGE: Exploring How Large Language Models Perform in Complex Social Interactive Environments

Authors

TL;DR

Abstract

Table of Contents

Figures (8)