Unveiling Narrative Reasoning Limits of Large Language Models with Trope in Movie Synopses

Hung-Ting Su; Ya-Ching Hsu; Xudong Lin; Xiang-Qian Shi; Yulei Niu; Han-Yuan Hsu; Hung-yi Lee; Winston H. Hsu

Unveiling Narrative Reasoning Limits of Large Language Models with Trope in Movie Synopses

Hung-Ting Su, Ya-Ching Hsu, Xudong Lin, Xiang-Qian Shi, Yulei Niu, Han-Yuan Hsu, Hung-yi Lee, Winston H. Hsu

TL;DR

This study utilizes tropes in movie synopses to assess the abstract reasoning abilities of state-of-the-art LLMs and introduces a trope-wise querying approach to address these challenges and boosts the F1 score by 11.8 points.

Abstract

Large language models (LLMs) equipped with chain-of-thoughts (CoT) prompting have shown significant multi-step reasoning capabilities in factual content like mathematics, commonsense, and logic. However, their performance in narrative reasoning, which demands greater abstraction capabilities, remains unexplored. This study utilizes tropes in movie synopses to assess the abstract reasoning abilities of state-of-the-art LLMs and uncovers their low performance. We introduce a trope-wise querying approach to address these challenges and boost the F1 score by 11.8 points. Moreover, while prior studies suggest that CoT enhances multi-step reasoning, this study shows CoT can cause hallucinations in narrative content, reducing GPT-4's performance. We also introduce an Adversarial Injection method to embed trope-related text tokens into movie synopses without explicit tropes, revealing CoT's heightened sensitivity to such injections. Our comprehensive analysis provides insights for future research directions.

Unveiling Narrative Reasoning Limits of Large Language Models with Trope in Movie Synopses

TL;DR

Abstract

Paper Structure (37 sections, 6 figures, 9 tables)

This paper contains 37 sections, 6 figures, 9 tables.

Introduction
Related Work
Large Language Models (LLMs)
LLM Reasoning
Tropes
Narrative Reasoning with TiMoS
Experimental Setup
Task.
Trope-wise Querying.
Prompting.
Fine-Tuning.
Large Language Models.
LLMs Struggle Reasoning TiMoS
Trope-wise Querying Improves LLMs
Challenges of Chain-of-Thoughts (CoT)
...and 22 more sections

Figures (6)

Figure 1: While LLMs have revolutionized NLP reasoning, surpassing previous supervised learning (SL) methods and even reaching human-level performance on some tasks, their limitations become apparent when tested against the Trope dataset. NLU: Natural Language Understanding, CS: Commonsense. Check Section \ref{['sec:1']} and \ref{['subsec:2:llmreason']} for details.
Figure 2: Trope in Movie Synopses (TiMoS) requires the abstraction of narrative reasoning beyond physical presentation. For example, themes of justice (red block) or sacrifice (blue block) extend beyond death. TiMoS also explores connections between seemingly unrelated ideas, such as Batman's departure and his efforts to save the city (blue block).
Figure 3: The distribution of each trope in forecasting a "yes" outcome varies across the five binary classification results within the subset. See Appendix for more results.
Figure 4: F1 score gaps between (1) left: GPT-4 and ChatGPT, (2) middle: ChatGPT + CoT and ChatGPT, and (3) right: Supervised state-of-the art MulCom chang2021situationtimos. In A vs. B comparisons, blue indicates that A outperforms B, red indicates that B outperforms A, and text size represents the gap size. (Section \ref{['subsec:4:7:pertrope']})
Figure 5: The relationship between accuracy and word length in the result of ChatGPT CoT trope-wise querying.
...and 1 more figures

Unveiling Narrative Reasoning Limits of Large Language Models with Trope in Movie Synopses

TL;DR

Abstract

Unveiling Narrative Reasoning Limits of Large Language Models with Trope in Movie Synopses

Authors

TL;DR

Abstract

Table of Contents

Figures (6)