Text-Video Multi-Grained Integration for Video Moment Montage
Zhihui Yin, Ye Ma, Xipeng Cao, Bo Wang, Quan Chen, Peng Jiang
TL;DR
The paper tackles the labor-intensive task of creating short video montages by introducing Video Moment Montage (VMM), where a narration script guides the extraction and ordering of segments from multiple candidate videos. It presents TV-MGI, a multi-grained transformer framework that fuses text with frame-level and shot-level video features through frame-shot-text cross-attention to achieve fine-grained and global alignment between scripts and video content. To support research, it introduces the MSSD dataset, a large-scale, fine-grained aligned text-video corpus with multi-sentence scripts and shot-level annotations, and demonstrates superior performance over strong baselines through extensive quantitative, qualitative, ablation, and user studies. The approach promises to streamline automatic montage generation, enabling scalable, coherent video editing driven by narration, with potential impact on content creation workflows and multimedia AI research.
Abstract
The proliferation of online short video platforms has driven a surge in user demand for short video editing. However, manually selecting, cropping, and assembling raw footage into a coherent, high-quality video remains laborious and time-consuming. To accelerate this process, we focus on a user-friendly new task called Video Moment Montage (VMM), which aims to accurately locate the corresponding video segments based on a pre-provided narration text and then arrange these video clips to create a complete video that aligns with the corresponding descriptions. The challenge lies in extracting precise temporal segments while ensuring intra-sentence and inter-sentence context consistency, as a single script sentence may require trimming and assembling multiple video clips. To address this problem, we present a novel \textit{Text-Video Multi-Grained Integration} method (TV-MGI) that efficiently fuses text features from the script with both shot-level and frame-level video features, which enables the global and fine-grained alignment between the video content and the corresponding textual descriptions in the script. To facilitate further research in this area, we introduce the Multiple Sentences with Shots Dataset (MSSD), a large-scale dataset designed explicitly for the VMM task. We conduct extensive experiments on the MSSD dataset to demonstrate the effectiveness of our framework compared to baseline methods.
