Large Language Models Fall Short: Understanding Complex Relationships in Detective Narratives

Runcong Zhao; Qinglin Zhu; Hainiu Xu; Jiazheng Li; Yuxiang Zhou; Yulan He; Lin Gui

Large Language Models Fall Short: Understanding Complex Relationships in Detective Narratives

Runcong Zhao, Qinglin Zhu, Hainiu Xu, Jiazheng Li, Yuxiang Zhou, Yulan He, Lin Gui

TL;DR

This work introduces Conan, a benchmark and dataset to study how large language models comprehend complex, multi-perspective character relationships in detective narratives. It defines a task framework with Character Extraction, Entity Linking, and Relation Deduction, and provides a hierarchical taxonomy (5 top-level, 54 intermediate, 163 detailed) to evaluate nuanced social connections, including public, secret, and inferred relations. Through experiments with GPT-3.5, GPT-4, and Llama2, the authors demonstrate that current models struggle with long narratives and conflicting perspectives, and they analyze three relation-detection strategies (AllTogether, DirRelation, PairRelation) plus ablation studies on character extraction quality. The findings highlight the need for improved inferential reasoning and information-management approaches (e.g., retrieval augmentation, chain-of-thought) to advance narrative understanding, with implications for creative writing analysis, interactive agents, and theory-of-mind research.

Abstract

Existing datasets for narrative understanding often fail to represent the complexity and uncertainty of relationships in real-life social scenarios. To address this gap, we introduce a new benchmark, Conan, designed for extracting and analysing intricate character relation graphs from detective narratives. Specifically, we designed hierarchical relationship categories and manually extracted and annotated role-oriented relationships from the perspectives of various characters, incorporating both public relationships known to most characters and secret ones known to only a few. Our experiments with advanced Large Language Models (LLMs) like GPT-3.5, GPT-4, and Llama2 reveal their limitations in inferencing complex relationships and handling longer narratives. The combination of the Conan dataset and our pipeline strategy is geared towards understanding the ability of LLMs to comprehend nuanced relational dynamics in narrative contexts.

Large Language Models Fall Short: Understanding Complex Relationships in Detective Narratives

TL;DR

Abstract

Paper Structure (56 sections, 4 figures, 13 tables)

This paper contains 56 sections, 4 figures, 13 tables.

Introduction
Task Definition
Dataset Construction
Data Collection and Processing
Collection
Filtering
Data Annotation and Evaluation
Relation Category Construction
Labelling
Inter-annotator Agreement
Data Statistics
Experiments
Baselines
Corruption Rate
Character Extraction
...and 41 more sections

Figures (4)

Figure 1: The example illustrates complex relationships of characters in narratives. Gray-colored relationships represent surface-level information, widely known to most characters. Orange-colored relationships, on the other hand, are secrets known to only one or very few individuals, often conflicting with the commonly known relationships; these are referred to as secret relationships. Red-colored relationships represent inferred information, meaning they are not explicitly stated in any character's story but can be deduced by synthesising information from all characters collectively. LLMs struggle with such complex relationships in long narratives.
Figure 2: Input-Output Format and Benchmark Relation Detection Strategies. The input narrative consists of $k$ background stories $N_{c_i}$ that are uniquely created from the perspective of the character $c_i$. Our objective is to extract all characters from the given story, including those beyond the initial $k$ characters, subsequently detect the relationships among all the extracted characters, even when they involve false or multiple identities, and finally uncover conflicting relationships in order to deduce the genuine nature of these relationships.
Figure 3: Dataset Construction.
Figure 4: F1-score against the length of given narrative.

Large Language Models Fall Short: Understanding Complex Relationships in Detective Narratives

TL;DR

Abstract

Large Language Models Fall Short: Understanding Complex Relationships in Detective Narratives

Authors

TL;DR

Abstract

Table of Contents

Figures (4)