Table of Contents
Fetching ...

Towards Long Video Understanding via Fine-detailed Video Story Generation

Zeng You, Zhiquan Wen, Yaofo Chen, Xin Li, Runhao Zeng, Yaowei Wang, Mingkui Tan

TL;DR

FDVS tackles the challenge of long-video understanding by converting lengthy footage into hierarchical textual representations through a Bottom-Up Video Interpretation Mechanism and a Semantic Redundancy Reduction strategy. By leveraging three-level perception models (object, action, caption) and Large Language Models to generate clip-level chapters and an overall video story, FDVS enables multi-granularity understanding without task-specific fine-tuning. The approach demonstrates strong zero-shot performance across eight datasets and three tasks, highlighting its versatility for retrieval and QA while reducing storage and maintaining efficiency for downstream use. These results suggest that structured textual representations can effectively capture complex long-form video semantics, offering practical impact for scalable video understanding and retrieval.

Abstract

Long video understanding has become a critical task in computer vision, driving advancements across numerous applications from surveillance to content retrieval. Existing video understanding methods suffer from two challenges when dealing with long video understanding: intricate long-context relationship modeling and interference from redundancy. To tackle these challenges, we introduce Fine-Detailed Video Story generation (FDVS), which interprets long videos into detailed textual representations. Specifically, to achieve fine-grained modeling of long-temporal content, we propose a Bottom-up Video Interpretation Mechanism that progressively interprets video content from clips to video. To avoid interference from redundant information in videos, we introduce a Semantic Redundancy Reduction mechanism that removes redundancy at both the visual and textual levels. Our method transforms long videos into hierarchical textual representations that contain multi-granularity information of the video. With these representations, FDVS is applicable to various tasks without any fine-tuning. We evaluate the proposed method across eight datasets spanning three tasks. The performance demonstrates the effectiveness and versatility of our method.

Towards Long Video Understanding via Fine-detailed Video Story Generation

TL;DR

FDVS tackles the challenge of long-video understanding by converting lengthy footage into hierarchical textual representations through a Bottom-Up Video Interpretation Mechanism and a Semantic Redundancy Reduction strategy. By leveraging three-level perception models (object, action, caption) and Large Language Models to generate clip-level chapters and an overall video story, FDVS enables multi-granularity understanding without task-specific fine-tuning. The approach demonstrates strong zero-shot performance across eight datasets and three tasks, highlighting its versatility for retrieval and QA while reducing storage and maintaining efficiency for downstream use. These results suggest that structured textual representations can effectively capture complex long-form video semantics, offering practical impact for scalable video understanding and retrieval.

Abstract

Long video understanding has become a critical task in computer vision, driving advancements across numerous applications from surveillance to content retrieval. Existing video understanding methods suffer from two challenges when dealing with long video understanding: intricate long-context relationship modeling and interference from redundancy. To tackle these challenges, we introduce Fine-Detailed Video Story generation (FDVS), which interprets long videos into detailed textual representations. Specifically, to achieve fine-grained modeling of long-temporal content, we propose a Bottom-up Video Interpretation Mechanism that progressively interprets video content from clips to video. To avoid interference from redundant information in videos, we introduce a Semantic Redundancy Reduction mechanism that removes redundancy at both the visual and textual levels. Our method transforms long videos into hierarchical textual representations that contain multi-granularity information of the video. With these representations, FDVS is applicable to various tasks without any fine-tuning. We evaluate the proposed method across eight datasets spanning three tasks. The performance demonstrates the effectiveness and versatility of our method.

Paper Structure

This paper contains 24 sections, 4 equations, 6 figures, 19 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of video understanding with LLMs. To allow LLMs without visual perception to understand the video content, we provide perception information from three levels, i.e., object level, temporal level, and scene level.
  • Figure 2: General pipeline of our method. We extract a compact hierarchical textual representation rather than deep features for downstream video understanding tasks. Given any video ${\boldsymbol{V}}$, we initially segment and sample it into clips based on keyframes. Redundant frames within each clip are removed using a Visual Redundancy Reduction strategy. Subsequently, we employ three perception foundation models to extract visual information. An LLM describes the clip content using the perception information. Redundant clips are removed via Textual Redundancy Reduction. Finally, LLM summarizes the video story with the remaining chapters.
  • Figure 3: Illustration of information extraction via three-level agent and prompt organization via a predefined template. We leverage well-trained vision models as perception agents to comprehensively extract visual information from frames. Then, we leverage LLMs to interpret each clip's content based on perception information.
  • Figure 4: Comparison against image caption methods on MSRVTT over the zero-shot text-to-video retrieval task.
  • Figure 5: Qualitative results of our method and VideoLLaVa videollava. The videos are from ActivityNet Captions.
  • ...and 1 more figures