Generating Event-oriented Attribution for Movies via Two-Stage Prefix-Enhanced Multimodal LLM

Yuanjie Lyu; Tong Xu; Zihan Niu; Bo Peng; Jing Ke; Enhong Chen

Generating Event-oriented Attribution for Movies via Two-Stage Prefix-Enhanced Multimodal LLM

Yuanjie Lyu, Tong Xu, Zihan Niu, Bo Peng, Jing Ke, Enhong Chen

TL;DR

This work proposes a Two-Stage Prefix-Enhanced MLLM (TSPE) approach for event attribution, i.e., connecting associated events with their causal semantics, in movie videos, and demonstrates that this framework outperforms state-of-the-art methods.

Abstract

The prosperity of social media platforms has raised the urgent demand for semantic-rich services, e.g., event and storyline attribution. However, most existing research focuses on clip-level event understanding, primarily through basic captioning tasks, without analyzing the causes of events across an entire movie. This is a significant challenge, as even advanced multimodal large language models (MLLMs) struggle with extensive multimodal information due to limited context length. To address this issue, we propose a Two-Stage Prefix-Enhanced MLLM (TSPE) approach for event attribution, i.e., connecting associated events with their causal semantics, in movie videos. In the local stage, we introduce an interaction-aware prefix that guides the model to focus on the relevant multimodal information within a single clip, briefly summarizing the single event. Correspondingly, in the global stage, we strengthen the connections between associated events using an inferential knowledge graph, and design an event-aware prefix that directs the model to focus on associated events rather than all preceding clips, resulting in accurate event attribution. Comprehensive evaluations of two real-world datasets demonstrate that our framework outperforms state-of-the-art methods.

Generating Event-oriented Attribution for Movies via Two-Stage Prefix-Enhanced Multimodal LLM

TL;DR

Abstract

Paper Structure (14 sections, 8 equations, 6 figures, 7 tables)

This paper contains 14 sections, 8 equations, 6 figures, 7 tables.

Introduction
Related Work
Video Semantic Understanding
Vision-Language Pre-trained Model
Method
Local Stage
Global Stage
Experiment
Datasets and Evaluation Metrics.
Experiment Setup
Comparison with SOTA Methods on Automatic Metrics
Ablation Study
Visualization
Conclusion

Figures (6)

Figure 1: A toy example selected from the MovieGraph dataset, which contains two clips from the movie Forrest Gump. Specifically, Clip 1 contains three events, and Clip 2 contains two events. The reason behind the event "Gump follows Dan Taylor on a mission" in Clip 2 is derived from the event "Gump salutes Dan Taylor" in Clip 1.
Figure 2: Illustration of the Two-Stage Prefix-Enhanced MLLM (TSPE) method for event attribution in movie videos: Stage 1 extracts multimodal cues to briefly summarize the event, while Stage 2 uses these descriptions to infer underlying event causes. During training, only the last three layers of the LLM and the prefixes are fine-tuned.
Figure 3: In the local stage, we compress the textual and visual information from video frames and subtitles into event-related embeddings. An attention mechanism then measures the semantic relevance between the content and interactions, creating an interaction-aware prefix. This prefix is used as input to the LLM, which is fine-tuned to summarize the event in a single clip.
Figure 4: In the global stage, the results from the local stage are fed into a common sense knowledge graph, ATOMIC, to predict potential outcomes of prior events. An attention mechanism assesses the relevance between the current and previous events, fusing them into event-aware embeddings. These embeddings are then input into the LLM, which is fine-tuned to generate the underlying causes of the events.
Figure 5: Local stage cases. A short video clip consists of synchronized frames and dialogues, and the interactions between characters. The gray fonts in the image represent that they are not involved in calculating metrics such as BLEU. Because the interaction text is already in the input, we will exclude the interaction part from the generated text when evaluating.
...and 1 more figures

Generating Event-oriented Attribution for Movies via Two-Stage Prefix-Enhanced Multimodal LLM

TL;DR

Abstract

Generating Event-oriented Attribution for Movies via Two-Stage Prefix-Enhanced Multimodal LLM

Authors

TL;DR

Abstract

Table of Contents

Figures (6)