Table of Contents
Fetching ...

Unveiling Narrative Reasoning Limits of Large Language Models with Trope in Movie Synopses

Hung-Ting Su, Ya-Ching Hsu, Xudong Lin, Xiang-Qian Shi, Yulei Niu, Han-Yuan Hsu, Hung-yi Lee, Winston H. Hsu

TL;DR

This study utilizes tropes in movie synopses to assess the abstract reasoning abilities of state-of-the-art LLMs and introduces a trope-wise querying approach to address these challenges and boosts the F1 score by 11.8 points.

Abstract

Large language models (LLMs) equipped with chain-of-thoughts (CoT) prompting have shown significant multi-step reasoning capabilities in factual content like mathematics, commonsense, and logic. However, their performance in narrative reasoning, which demands greater abstraction capabilities, remains unexplored. This study utilizes tropes in movie synopses to assess the abstract reasoning abilities of state-of-the-art LLMs and uncovers their low performance. We introduce a trope-wise querying approach to address these challenges and boost the F1 score by 11.8 points. Moreover, while prior studies suggest that CoT enhances multi-step reasoning, this study shows CoT can cause hallucinations in narrative content, reducing GPT-4's performance. We also introduce an Adversarial Injection method to embed trope-related text tokens into movie synopses without explicit tropes, revealing CoT's heightened sensitivity to such injections. Our comprehensive analysis provides insights for future research directions.

Unveiling Narrative Reasoning Limits of Large Language Models with Trope in Movie Synopses

TL;DR

This study utilizes tropes in movie synopses to assess the abstract reasoning abilities of state-of-the-art LLMs and introduces a trope-wise querying approach to address these challenges and boosts the F1 score by 11.8 points.

Abstract

Large language models (LLMs) equipped with chain-of-thoughts (CoT) prompting have shown significant multi-step reasoning capabilities in factual content like mathematics, commonsense, and logic. However, their performance in narrative reasoning, which demands greater abstraction capabilities, remains unexplored. This study utilizes tropes in movie synopses to assess the abstract reasoning abilities of state-of-the-art LLMs and uncovers their low performance. We introduce a trope-wise querying approach to address these challenges and boost the F1 score by 11.8 points. Moreover, while prior studies suggest that CoT enhances multi-step reasoning, this study shows CoT can cause hallucinations in narrative content, reducing GPT-4's performance. We also introduce an Adversarial Injection method to embed trope-related text tokens into movie synopses without explicit tropes, revealing CoT's heightened sensitivity to such injections. Our comprehensive analysis provides insights for future research directions.
Paper Structure (37 sections, 6 figures, 9 tables)

This paper contains 37 sections, 6 figures, 9 tables.

Figures (6)

  • Figure 1: While LLMs have revolutionized NLP reasoning, surpassing previous supervised learning (SL) methods and even reaching human-level performance on some tasks, their limitations become apparent when tested against the Trope dataset. NLU: Natural Language Understanding, CS: Commonsense. Check Section \ref{['sec:1']} and \ref{['subsec:2:llmreason']} for details.
  • Figure 2: Trope in Movie Synopses (TiMoS) requires the abstraction of narrative reasoning beyond physical presentation. For example, themes of justice (red block) or sacrifice (blue block) extend beyond death. TiMoS also explores connections between seemingly unrelated ideas, such as Batman's departure and his efforts to save the city (blue block).
  • Figure 3: The distribution of each trope in forecasting a "yes" outcome varies across the five binary classification results within the subset. See Appendix for more results.
  • Figure 4: F1 score gaps between (1) left: GPT-4 and ChatGPT, (2) middle: ChatGPT + CoT and ChatGPT, and (3) right: Supervised state-of-the art MulCom chang2021situationtimos. In A vs. B comparisons, blue indicates that A outperforms B, red indicates that B outperforms A, and text size represents the gap size. (Section \ref{['subsec:4:7:pertrope']})
  • Figure 5: The relationship between accuracy and word length in the result of ChatGPT CoT trope-wise querying.
  • ...and 1 more figures