From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding
Xiangfeng Wang, Xiao Li, Yadong Wei, Xueyu Song, Yang Song, Xiaoqiang Xia, Fangrui Zeng, Zaiyi Chen, Liu Liu, Gu Xu, Tong Xu
TL;DR
The paper addresses the challenge of condensing long-form videos into concise, engaging clips by integrating multimodal narrative understanding to overcome ASR-centric editing limitations. It introduces HIVE, a framework that fuses visual context, dialogue analysis, and narrative summaries via multimodal LLMs, combined with scene-level segmentation and a three-task editing workflow. The authors contribute a DramaAD dataset for benchmarking, along with evaluation metrics spanning diversity, smoothness, engagement, and ad-specific hooks. Empirical results show that HIVE outperforms baselines across general and advertisement-edited tasks, narrowing the gap to human editing and enabling scalable, high-quality automatic video editing. The work advances practical automatic editing for both mainstream content and advertising contexts and provides a new research resource for multimodal narrative understanding in video editing.
Abstract
The rapid growth of online video content, especially on short video platforms, has created a growing demand for efficient video editing techniques that can condense long-form videos into concise and engaging clips. Existing automatic editing methods predominantly rely on textual cues from ASR transcripts and end-to-end segment selection, often neglecting the rich visual context and leading to incoherent outputs. In this paper, we propose a human-inspired automatic video editing framework (HIVE) that leverages multimodal narrative understanding to address these limitations. Our approach incorporates character extraction, dialogue analysis, and narrative summarization through multimodal large language models, enabling a holistic understanding of the video content. To further enhance coherence, we apply scene-level segmentation and decompose the editing process into three subtasks: highlight detection, opening/ending selection, and pruning of irrelevant content. To facilitate research in this area, we introduce DramaAD, a novel benchmark dataset comprising over 800 short drama episodes and 500 professionally edited advertisement clips. Experimental results demonstrate that our framework consistently outperforms existing baselines across both general and advertisement-oriented editing tasks, significantly narrowing the quality gap between automatic and human-edited videos.
