Table of Contents
Fetching ...

Text-Video Multi-Grained Integration for Video Moment Montage

Zhihui Yin, Ye Ma, Xipeng Cao, Bo Wang, Quan Chen, Peng Jiang

TL;DR

The paper tackles the labor-intensive task of creating short video montages by introducing Video Moment Montage (VMM), where a narration script guides the extraction and ordering of segments from multiple candidate videos. It presents TV-MGI, a multi-grained transformer framework that fuses text with frame-level and shot-level video features through frame-shot-text cross-attention to achieve fine-grained and global alignment between scripts and video content. To support research, it introduces the MSSD dataset, a large-scale, fine-grained aligned text-video corpus with multi-sentence scripts and shot-level annotations, and demonstrates superior performance over strong baselines through extensive quantitative, qualitative, ablation, and user studies. The approach promises to streamline automatic montage generation, enabling scalable, coherent video editing driven by narration, with potential impact on content creation workflows and multimedia AI research.

Abstract

The proliferation of online short video platforms has driven a surge in user demand for short video editing. However, manually selecting, cropping, and assembling raw footage into a coherent, high-quality video remains laborious and time-consuming. To accelerate this process, we focus on a user-friendly new task called Video Moment Montage (VMM), which aims to accurately locate the corresponding video segments based on a pre-provided narration text and then arrange these video clips to create a complete video that aligns with the corresponding descriptions. The challenge lies in extracting precise temporal segments while ensuring intra-sentence and inter-sentence context consistency, as a single script sentence may require trimming and assembling multiple video clips. To address this problem, we present a novel \textit{Text-Video Multi-Grained Integration} method (TV-MGI) that efficiently fuses text features from the script with both shot-level and frame-level video features, which enables the global and fine-grained alignment between the video content and the corresponding textual descriptions in the script. To facilitate further research in this area, we introduce the Multiple Sentences with Shots Dataset (MSSD), a large-scale dataset designed explicitly for the VMM task. We conduct extensive experiments on the MSSD dataset to demonstrate the effectiveness of our framework compared to baseline methods.

Text-Video Multi-Grained Integration for Video Moment Montage

TL;DR

The paper tackles the labor-intensive task of creating short video montages by introducing Video Moment Montage (VMM), where a narration script guides the extraction and ordering of segments from multiple candidate videos. It presents TV-MGI, a multi-grained transformer framework that fuses text with frame-level and shot-level video features through frame-shot-text cross-attention to achieve fine-grained and global alignment between scripts and video content. To support research, it introduces the MSSD dataset, a large-scale, fine-grained aligned text-video corpus with multi-sentence scripts and shot-level annotations, and demonstrates superior performance over strong baselines through extensive quantitative, qualitative, ablation, and user studies. The approach promises to streamline automatic montage generation, enabling scalable, coherent video editing driven by narration, with potential impact on content creation workflows and multimedia AI research.

Abstract

The proliferation of online short video platforms has driven a surge in user demand for short video editing. However, manually selecting, cropping, and assembling raw footage into a coherent, high-quality video remains laborious and time-consuming. To accelerate this process, we focus on a user-friendly new task called Video Moment Montage (VMM), which aims to accurately locate the corresponding video segments based on a pre-provided narration text and then arrange these video clips to create a complete video that aligns with the corresponding descriptions. The challenge lies in extracting precise temporal segments while ensuring intra-sentence and inter-sentence context consistency, as a single script sentence may require trimming and assembling multiple video clips. To address this problem, we present a novel \textit{Text-Video Multi-Grained Integration} method (TV-MGI) that efficiently fuses text features from the script with both shot-level and frame-level video features, which enables the global and fine-grained alignment between the video content and the corresponding textual descriptions in the script. To facilitate further research in this area, we introduce the Multiple Sentences with Shots Dataset (MSSD), a large-scale dataset designed explicitly for the VMM task. We conduct extensive experiments on the MSSD dataset to demonstrate the effectiveness of our framework compared to baseline methods.

Paper Structure

This paper contains 21 sections, 8 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Differences between VMM and other tasks. (a) VMG involves creating video montages using complete videos. (b) VCMR utilizes individual sentences to retrieve specific moments within a video corpus but cannot form a coherent script-level video. (c) The key distinction of our proposed VMM task is that the video montage is composed of selected moments from the videos, and a single sentence can correspond to the composition of multiple video fragments, enabling more flexible video construction.
  • Figure 2: The illustration of basic architectures of our proposed TV-MGI. From left to right, there are the overall framework, multi-grained fusion module, and the prediction head. We first employ visual and text encoders to map the input text and video into embeddings. Next, we apply attention fusion to the visual and textual features at both the shot- and frame-level. The output representations from the fusion encoder are utilized for prediction at both the shot- and frame-level, enabling the generation of multiple video segments corresponding to each sentence in the script.
  • Figure 3: Left. The statistics of scripts and videos contained in our training dataset. Right. Displaying a sample of dataset, including the script, shots, and the temporal alignment between visuals and script timeline. More dataset details and cases are in the supplementary materials.
  • Figure 4: An example of short video montages generated by different methods. Four consecutive sentences from a single script and the corresponding video segments generated by each method are presented in separate rows. The orange dashed line represents the recall of correct segments. Our method performs better in terms of both semantic and temporal consistency.
  • Figure 5: User study results. For a given script, users ranked videos generated by four different methods, where rank 1 is best, and rank 4 is worst. We report the frequency of each rank received by each method and the average rank for each.