TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models

Jaewoo Ahn; Taehyun Lee; Junyoung Lim; Jin-Hwa Kim; Sangdoo Yun; Hwaran Lee; Gunhee Kim

TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models

Jaewoo Ahn, Taehyun Lee, Junyoung Lim, Jin-Hwa Kim, Sangdoo Yun, Hwaran Lee, Gunhee Kim

TL;DR

TimeChara introduces a large-scale benchmark to probe point-in-time character hallucination in role-playing LLMs, emphasizing spatiotemporal self-consistency. It provides an automated pipeline to generate 10,895 interview-style instances across 14 fictional characters from four popular series, paired with explicit spatiotemporal labels. The authors show significant hallucinations in current models and propose Narrative-Experts, a decomposed reasoning approach with temporal and spatial specialists, to mitigate errors. Across multiple backbone LLMs and evaluation setups, Narrative-Experts improves performance, but the study highlights persistent challenges in maintaining character knowledge boundaries over time, motivating further research in this area.

Abstract

While Large Language Models (LLMs) can serve as agents to simulate human behaviors (i.e., role-playing agents), we emphasize the importance of point-in-time role-playing. This situates characters at specific moments in the narrative progression for three main reasons: (i) enhancing users' narrative immersion, (ii) avoiding spoilers, and (iii) fostering engagement in fandom role-playing. To accurately represent characters at specific time points, agents must avoid character hallucination, where they display knowledge that contradicts their characters' identities and historical timelines. We introduce TimeChara, a new benchmark designed to evaluate point-in-time character hallucination in role-playing LLMs. Comprising 10,895 instances generated through an automated pipeline, this benchmark reveals significant hallucination issues in current state-of-the-art LLMs (e.g., GPT-4o). To counter this challenge, we propose Narrative-Experts, a method that decomposes the reasoning steps and utilizes narrative experts to reduce point-in-time character hallucinations effectively. Still, our findings with TimeChara highlight the ongoing challenges of point-in-time character hallucination, calling for further study.

TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models

TL;DR

Abstract

Paper Structure (51 sections, 7 figures, 29 tables, 1 algorithm)

This paper contains 51 sections, 7 figures, 29 tables, 1 algorithm.

Introduction
Related Work
Role-playing LLM agents.
LLM's temporal reasoning capability.
The TimeChara Benchmark
Fact-Based Interview
Fake-Based Interview
Evaluation on TimeChara
Step-by-step evaluation with spatiotemporal labels.
Dataset Construction
Dataset Analyses
Decomposed Reasoning
Experiments on TimeChara
Dataset Sampling for Evaluation
Baseline Methods
...and 36 more sections

Figures (7)

Figure 1: An illustrative figure of point-in-time character hallucination demonstrated by a role-playing agent simulating Harry Potter. (Top) The agent, simulating Harry Potter at 37 years old, consistently responds to the user's queries. (Bottom) The agent, simulating Harry Potter in his fifth year at Hogwarts, erroneously mentions a future event — his marriage to Ginny Weasley — which occurs after his fifth year.
Figure 2: Evaluation accuracy of LLM judges for spatiotemporal consistency. Judges with spatiotemporal labels show superior performance compared to those without in both GPT-4/3.5. We randomly select 300 data instances containing responses generated by GPT-4 Turbo (see Table \ref{['tab:main_experiment']}) and manually annotate them with binary labels to indicate whether spatiotemporal consistency holds or not. We compare the relative evaluation accuracy of LLM judges with humans (marked by 100). 'Total' denotes the average score across all cases.
Figure 3: An illustration of our automated pipeline for constructing TimeChara. See Table \ref{['tab:fact_structured_future_example']} and Appendix \ref{['sec:examples_of_timechara']} for examples of the complete dataset.
Figure 4: A nested pie chart of verb-noun structures in free-form questions, encompassing both fact-based and fake-based.
Figure 5: A nested pie chart of verb-noun structures in structured questions.
...and 2 more figures

TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models

TL;DR

Abstract

TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)