Table of Contents
Fetching ...

Agent-based Video Trimming

Lingfeng Yang, Zhenyuan Chen, Xiang Li, Peiyang Jia, Liangqu Long, Jian Yang

TL;DR

Video Trimming (VT) tackles long-form video summarization by selecting and arranging segments to form a coherent narrative, addressing wasted footage. It proposes Agent-based Video Trimming (AVT), a three-phase, agent-driven framework that converts video slices into structured text, filters clips with a dynamic defect–highlight mechanism, and composes a final narrative via a Video Arrangement Agent, all guided by a Video Evaluation Agent. The authors introduce a new trimming dataset and demonstrate AVT achieves superior trimming quality and zero-shot highlight detection on YouTube Highlights, TVSum, and their own dataset, validated by user studies and automatic metrics. This work advances practical, story-driven video summarization for long videos and provides a benchmark for future research.

Abstract

As information becomes more accessible, user-generated videos are increasing in length, placing a burden on viewers to sift through vast content for valuable insights. This trend underscores the need for an algorithm to extract key video information efficiently. Despite significant advancements in highlight detection, moment retrieval, and video summarization, current approaches primarily focus on selecting specific time intervals, often overlooking the relevance between segments and the potential for segment arranging. In this paper, we introduce a novel task called Video Trimming (VT), which focuses on detecting wasted footage, selecting valuable segments, and composing them into a final video with a coherent story. To address this task, we propose Agent-based Video Trimming (AVT), structured into three phases: Video Structuring, Clip Filtering, and Story Composition. Specifically, we employ a Video Captioning Agent to convert video slices into structured textual descriptions, a Filtering Module to dynamically discard low-quality footage based on the structured information of each clip, and a Video Arrangement Agent to select and compile valid clips into a coherent final narrative. For evaluation, we develop a Video Evaluation Agent to assess trimmed videos, conducting assessments in parallel with human evaluations. Additionally, we curate a new benchmark dataset for video trimming using raw user videos from the internet. As a result, AVT received more favorable evaluations in user studies and demonstrated superior mAP and precision on the YouTube Highlights, TVSum, and our own dataset for the highlight detection task. The code and models are available at https://ylingfeng.github.io/AVT.

Agent-based Video Trimming

TL;DR

Video Trimming (VT) tackles long-form video summarization by selecting and arranging segments to form a coherent narrative, addressing wasted footage. It proposes Agent-based Video Trimming (AVT), a three-phase, agent-driven framework that converts video slices into structured text, filters clips with a dynamic defect–highlight mechanism, and composes a final narrative via a Video Arrangement Agent, all guided by a Video Evaluation Agent. The authors introduce a new trimming dataset and demonstrate AVT achieves superior trimming quality and zero-shot highlight detection on YouTube Highlights, TVSum, and their own dataset, validated by user studies and automatic metrics. This work advances practical, story-driven video summarization for long videos and provides a benchmark for future research.

Abstract

As information becomes more accessible, user-generated videos are increasing in length, placing a burden on viewers to sift through vast content for valuable insights. This trend underscores the need for an algorithm to extract key video information efficiently. Despite significant advancements in highlight detection, moment retrieval, and video summarization, current approaches primarily focus on selecting specific time intervals, often overlooking the relevance between segments and the potential for segment arranging. In this paper, we introduce a novel task called Video Trimming (VT), which focuses on detecting wasted footage, selecting valuable segments, and composing them into a final video with a coherent story. To address this task, we propose Agent-based Video Trimming (AVT), structured into three phases: Video Structuring, Clip Filtering, and Story Composition. Specifically, we employ a Video Captioning Agent to convert video slices into structured textual descriptions, a Filtering Module to dynamically discard low-quality footage based on the structured information of each clip, and a Video Arrangement Agent to select and compile valid clips into a coherent final narrative. For evaluation, we develop a Video Evaluation Agent to assess trimmed videos, conducting assessments in parallel with human evaluations. Additionally, we curate a new benchmark dataset for video trimming using raw user videos from the internet. As a result, AVT received more favorable evaluations in user studies and demonstrated superior mAP and precision on the YouTube Highlights, TVSum, and our own dataset for the highlight detection task. The code and models are available at https://ylingfeng.github.io/AVT.

Paper Structure

This paper contains 26 sections, 2 equations, 15 figures, 10 tables, 1 algorithm.

Figures (15)

  • Figure 1: A comparison between our new task and existing video tasks: (a) Highlight Detection retrieves clips above a saliency threshold. (b) Moment Retrieval identifies the start and end for intervals related to a given query. (c) Video Summarization extracts keyframes for each theme of the video. (d) Video Trimming addresses more than just a retrieval task by also filtering wasted footage and logically composing the selected segments.
  • Figure 2: The overall framework of AVT. The approach first (a) converts sampled video content into structured captions and attributes, then (b) discards defective clips, and finally (c) organizes the remaining clips into a coherent final cut.
  • Figure 3: Keyframes from a mountain biking video. Clips marked with red boxes are discarded due to higher defect scores, while clips with green boxes are selected despite minor shaking, as they highlight the dynamic scene of cycling on a mountain path.
  • Figure 4: The overall framework of the story composition phase begins with inputting the task introduction to the LLM to generate a CoT of the composition steps. With minor adjustments, we call the Video Arrangement Agent, prompted with the refined user input, to sequentially select clips and arrange the story. The output consists of the selected clip indices and segmented story content.
  • Figure 5: Highlight detection results of mAP and precision on our collected video trimming dataset.
  • ...and 10 more figures