Agent-based Video Trimming
Lingfeng Yang, Zhenyuan Chen, Xiang Li, Peiyang Jia, Liangqu Long, Jian Yang
TL;DR
Video Trimming (VT) tackles long-form video summarization by selecting and arranging segments to form a coherent narrative, addressing wasted footage. It proposes Agent-based Video Trimming (AVT), a three-phase, agent-driven framework that converts video slices into structured text, filters clips with a dynamic defect–highlight mechanism, and composes a final narrative via a Video Arrangement Agent, all guided by a Video Evaluation Agent. The authors introduce a new trimming dataset and demonstrate AVT achieves superior trimming quality and zero-shot highlight detection on YouTube Highlights, TVSum, and their own dataset, validated by user studies and automatic metrics. This work advances practical, story-driven video summarization for long videos and provides a benchmark for future research.
Abstract
As information becomes more accessible, user-generated videos are increasing in length, placing a burden on viewers to sift through vast content for valuable insights. This trend underscores the need for an algorithm to extract key video information efficiently. Despite significant advancements in highlight detection, moment retrieval, and video summarization, current approaches primarily focus on selecting specific time intervals, often overlooking the relevance between segments and the potential for segment arranging. In this paper, we introduce a novel task called Video Trimming (VT), which focuses on detecting wasted footage, selecting valuable segments, and composing them into a final video with a coherent story. To address this task, we propose Agent-based Video Trimming (AVT), structured into three phases: Video Structuring, Clip Filtering, and Story Composition. Specifically, we employ a Video Captioning Agent to convert video slices into structured textual descriptions, a Filtering Module to dynamically discard low-quality footage based on the structured information of each clip, and a Video Arrangement Agent to select and compile valid clips into a coherent final narrative. For evaluation, we develop a Video Evaluation Agent to assess trimmed videos, conducting assessments in parallel with human evaluations. Additionally, we curate a new benchmark dataset for video trimming using raw user videos from the internet. As a result, AVT received more favorable evaluations in user studies and demonstrated superior mAP and precision on the YouTube Highlights, TVSum, and our own dataset for the highlight detection task. The code and models are available at https://ylingfeng.github.io/AVT.
