Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies

Hung-Ting Su; Chun-Tong Chao; Ya-Ching Hsu; Xudong Lin; Yulei Niu; Hung-Yi Lee; Winston H. Hsu

Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies

Hung-Ting Su, Chun-Tong Chao, Ya-Ching Hsu, Xudong Lin, Yulei Niu, Hung-Yi Lee, Winston H. Hsu

TL;DR

This work introduces TiM, a tropes-based dataset designed to stress-test large language model–driven video reasoning along two core axes: Abstract Perception and Long-range Compositional Reasoning in long videos. By benchmarking state-of-the-art approaches (Captioner-Reasoner, Large Multimodal Model Instruction Fine-tuning, Visual Programming, and Gemini 1.5), the authors demonstrate substantial gaps relative to human performance, with the best models achieving around $F1=40$ versus humans at $F1=65$. To close this gap, they propose FEVoRI, which enhances Face-aware tokenization, and ConQueR, which decouples context from trope queries during reasoning; together these yield about a $15$ point $F1$ improvement, though human parity remains elusive. They also introduce ABCD, an AST-based protocol to quantify Abstract Perception and Long-range Compositional Reasoning in VP-generated code, revealing TiM’s higher demands for abstraction and long-range integration. Overall, TiM provides a rigorous testbed for advancing multimodal reasoning with long-form video content, and the associated dataset and code release support further research in this direction.

Abstract

Large Language Models (LLMs) have demonstrated effectiveness not only in language tasks but also in video reasoning. This paper introduces a novel dataset, Tropes in Movies (TiM), designed as a testbed for exploring two critical yet previously overlooked video reasoning skills: (1) Abstract Perception: understanding and tokenizing abstract concepts in videos, and (2) Long-range Compositional Reasoning: planning and integrating intermediate reasoning steps for understanding long-range videos with numerous frames. Utilizing tropes from movie storytelling, TiM evaluates the reasoning capabilities of state-of-the-art LLM-based approaches. Our experiments show that current methods, including Captioner-Reasoner, Large Multimodal Model Instruction Fine-tuning, and Visual Programming, only marginally outperform a random baseline when tackling the challenges of Abstract Perception and Long-range Compositional Reasoning. To address these deficiencies, we propose Face-Enhanced Viper of Role Interactions (FEVoRI) and Context Query Reduction (ConQueR), which enhance Visual Programming by fostering role interaction awareness and progressively refining movie contexts and trope queries during reasoning processes, significantly improving performance by 15 F1 points. However, this performance still lags behind human levels (40 vs. 65 F1). Additionally, we introduce a new protocol to evaluate the necessity of Abstract Perception and Long-range Compositional Reasoning for task resolution. This is done by analyzing the code generated through Visual Programming using an Abstract Syntax Tree (AST), thereby confirming the increased complexity of TiM. The dataset and code are available at: https://ander1119.github.io/TiM

Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies

TL;DR

versus humans at

. To close this gap, they propose FEVoRI, which enhances Face-aware tokenization, and ConQueR, which decouples context from trope queries during reasoning; together these yield about a

point

improvement, though human parity remains elusive. They also introduce ABCD, an AST-based protocol to quantify Abstract Perception and Long-range Compositional Reasoning in VP-generated code, revealing TiM’s higher demands for abstraction and long-range integration. Overall, TiM provides a rigorous testbed for advancing multimodal reasoning with long-form video content, and the associated dataset and code release support further research in this direction.

Abstract

Paper Structure (40 sections, 2 figures, 5 tables)

This paper contains 40 sections, 2 figures, 5 tables.

Introduction
Related Work
Comparison to Existing Tasks
Tropes in Movies
Trope in Movies (TiM) Dataset
Overview
Trope
Task Definition
Evaluation
Data Collection
Data Statistics
Experiments
Baselines
Captioner-Reasoner
Large Multimodal Model Instruction Fine-tuning
...and 25 more sections

Figures (2)

Figure 1: Compared to previous datasets like NExT-QA nextqaxiao2021next, Tropes in Movies (TiM) introduces the challenges of Abstract Perception (upper box) and Long-range Compositional Reasoning (lower box), offering a robust framework for evaluating and developing LLM-based methods. The blue text (action) indicates that the answer to the action query will affect the input of the judgment query and causal query, which means decomposing these complex elements necessitates multiple, nested queries that are interdependent.
Figure 2: Word cloud of trope occurrences in Fullset, size of the tropes in proportion to their frequency in Fullset and color of the tropes correspond to the category they belongs

Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies

TL;DR

Abstract

Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies

Authors

TL;DR

Abstract

Table of Contents

Figures (2)