Towards Long Video Understanding via Fine-detailed Video Story Generation
Zeng You, Zhiquan Wen, Yaofo Chen, Xin Li, Runhao Zeng, Yaowei Wang, Mingkui Tan
TL;DR
FDVS tackles the challenge of long-video understanding by converting lengthy footage into hierarchical textual representations through a Bottom-Up Video Interpretation Mechanism and a Semantic Redundancy Reduction strategy. By leveraging three-level perception models (object, action, caption) and Large Language Models to generate clip-level chapters and an overall video story, FDVS enables multi-granularity understanding without task-specific fine-tuning. The approach demonstrates strong zero-shot performance across eight datasets and three tasks, highlighting its versatility for retrieval and QA while reducing storage and maintaining efficiency for downstream use. These results suggest that structured textual representations can effectively capture complex long-form video semantics, offering practical impact for scalable video understanding and retrieval.
Abstract
Long video understanding has become a critical task in computer vision, driving advancements across numerous applications from surveillance to content retrieval. Existing video understanding methods suffer from two challenges when dealing with long video understanding: intricate long-context relationship modeling and interference from redundancy. To tackle these challenges, we introduce Fine-Detailed Video Story generation (FDVS), which interprets long videos into detailed textual representations. Specifically, to achieve fine-grained modeling of long-temporal content, we propose a Bottom-up Video Interpretation Mechanism that progressively interprets video content from clips to video. To avoid interference from redundant information in videos, we introduce a Semantic Redundancy Reduction mechanism that removes redundancy at both the visual and textual levels. Our method transforms long videos into hierarchical textual representations that contain multi-granularity information of the video. With these representations, FDVS is applicable to various tasks without any fine-tuning. We evaluate the proposed method across eight datasets spanning three tasks. The performance demonstrates the effectiveness and versatility of our method.
