Table of Contents
Fetching ...

MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction

Yizhi Li, Xiaohan Chen, Miao Jiang, Wentao Tang, Gaoang Wang

TL;DR

MovieTeller is proposed, a novel framework for generating movie synopses via tool-augmented progressive abstraction that decomposes the summarization of a full-length movie into a multi-stage process, effectively mitigating the context length limitations of current VLMs.

Abstract

With the explosive growth of digital entertainment, automated video summarization has become indispensable for applications such as content indexing, personalized recommendation, and efficient media archiving. Automatic synopsis generation for long-form videos, such as movies and TV series, presents a significant challenge for existing Vision-Language Models (VLMs). While proficient at single-image captioning, these general-purpose models often exhibit critical failures in long-duration contexts, primarily a lack of ID-consistent character identification and a fractured narrative coherence. To overcome these limitations, we propose MovieTeller, a novel framework for generating movie synopses via tool-augmented progressive abstraction. Our core contribution is a training-free, tool-augmented, fact-grounded generation process. Instead of requiring costly model fine-tuning, our framework directly leverages off-the-shelf models in a plug-and-play manner. We first invoke a specialized face recognition model as an external "tool" to establish Factual Groundings--precise character identities and their corresponding bounding boxes. These groundings are then injected into the prompt to steer the VLM's reasoning, ensuring the generated scene descriptions are anchored to verifiable facts. Furthermore, our progressive abstraction pipeline decomposes the summarization of a full-length movie into a multi-stage process, effectively mitigating the context length limitations of current VLMs. Experiments demonstrate that our approach yields significant improvements in factual accuracy, character consistency, and overall narrative coherence compared to end-to-end baselines.

MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction

TL;DR

MovieTeller is proposed, a novel framework for generating movie synopses via tool-augmented progressive abstraction that decomposes the summarization of a full-length movie into a multi-stage process, effectively mitigating the context length limitations of current VLMs.

Abstract

With the explosive growth of digital entertainment, automated video summarization has become indispensable for applications such as content indexing, personalized recommendation, and efficient media archiving. Automatic synopsis generation for long-form videos, such as movies and TV series, presents a significant challenge for existing Vision-Language Models (VLMs). While proficient at single-image captioning, these general-purpose models often exhibit critical failures in long-duration contexts, primarily a lack of ID-consistent character identification and a fractured narrative coherence. To overcome these limitations, we propose MovieTeller, a novel framework for generating movie synopses via tool-augmented progressive abstraction. Our core contribution is a training-free, tool-augmented, fact-grounded generation process. Instead of requiring costly model fine-tuning, our framework directly leverages off-the-shelf models in a plug-and-play manner. We first invoke a specialized face recognition model as an external "tool" to establish Factual Groundings--precise character identities and their corresponding bounding boxes. These groundings are then injected into the prompt to steer the VLM's reasoning, ensuring the generated scene descriptions are anchored to verifiable facts. Furthermore, our progressive abstraction pipeline decomposes the summarization of a full-length movie into a multi-stage process, effectively mitigating the context length limitations of current VLMs. Experiments demonstrate that our approach yields significant improvements in factual accuracy, character consistency, and overall narrative coherence compared to end-to-end baselines.
Paper Structure (22 sections, 6 equations, 4 figures, 3 tables)

This paper contains 22 sections, 6 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The advantage of our MovieTeller over general VLM. General VLM only plainly state the content of the frames, lacking ID consistency and narrative coherence. MovieTeller can ensure the accurate identification of characters, thereby guaranteeing the integrity and continuity of movie synopses.
  • Figure 2: The overall architecture of our proposed MovieTeller framework. The framework initiates by processing a long-form video to extract high-quality keyframes through scene segmentation and a quality gate. A key innovation is the subsequent tool-augmented stage, where an expert tool provides factual groundings (character ID, BBox) to a VLM, ensuring ID-consistent scene descriptions. This information is then progressively abstracted, first into chapter summaries and finally integrated into the complete movie synopsis.
  • Figure 3: Dataset diversity statistics (100 movies). From left to right: distributions of release year, language, and genre.
  • Figure 4: Qualitative comparison of the final, full-movie synopses generated by each method. Color legend: Green for correct and detailed information, Orange for vague information, Red for incorrect information. MovieTeller produces a synopsis that is superior in both factual detail and narrative depth.