Multi-sentence Video Grounding for Long Video Generation

Wei Feng; Xin Wang; Hong Chen; Zeyang Zhang; Wenwu Zhu

Multi-sentence Video Grounding for Long Video Generation

Wei Feng, Xin Wang, Hong Chen, Zeyang Zhang, Wenwu Zhu

TL;DR

This work tackles the challenge of long video generation by introducing a retrieval-augmented framework based on Multi-sentence Video Grounding. It retrieves grounded source video moments via Massive Video Moment Retrieval and Reliable Mutual Matching Network, edits them with text-guided diffusion (DDIM) while leveraging ControlNet conditioning to preserve temporal coherence, and optionally morphs and personalizes subjects using LoRA and DreamBooth-style fine-tuning. Key contributions include the first study of applying multi-sentence grounding to long video generation, an end-to-end pipeline Grounding(q1,...,qn) → Editing → V' with morphing/personalization, and extensive ablations demonstrating improved subject consistency and reduced flickering on multiple datasets. The approach reduces memory cost by segment-wise editing and offers practical potential for scalable, retrieval-augmented long-video creation, with clear avenues for strengthening grounding models and expanding video corpora in future work.

Abstract

Video generation has witnessed great success recently, but their application in generating long videos still remains challenging due to the difficulty in maintaining the temporal consistency of generated videos and the high memory cost during generation. To tackle the problems, in this paper, we propose a brave and new idea of Multi-sentence Video Grounding for Long Video Generation, connecting the massive video moment retrieval to the video generation task for the first time, providing a new paradigm for long video generation. The method of our work can be summarized as three steps: (i) We design sequential scene text prompts as the queries for video grounding, utilizing the massive video moment retrieval to search for video moment segments that meet the text requirements in the video database. (ii) Based on the source frames of retrieved video moment segments, we adopt video editing methods to create new video content while preserving the temporal consistency of the retrieved video. Since the editing can be conducted segment by segment, and even frame by frame, it largely reduces the memory cost. (iii) We also attempt video morphing and personalized generation methods to improve the subject consistency of long video generation, providing ablation experimental results for the subtasks of long video generation. Our approach seamlessly extends the development in image/video editing, video morphing and personalized generation, and video grounding to the long video generation, offering effective solutions for generating long videos at low memory cost.

Multi-sentence Video Grounding for Long Video Generation

TL;DR

Abstract

Paper Structure (24 sections, 6 equations, 5 figures, 3 tables)

This paper contains 24 sections, 6 equations, 5 figures, 3 tables.

Introduction
Related work
Video Grounding
Long Video Generation
Video Editing
method
Multi-sentence Video Moment Grounding
Text guided Video Editing
Video Morphing and Personalization
Video Morphing
Video Personalization
Experiments
Setups
Datasets.
Models.
...and 9 more sections

Figures (5)

Figure 1: Framework of Multi-sentence Video Grounding for Long Video Generation. In the stage of Multi-sentence Video Moment Grounding, we input a sentence of queries $(q_1,q_2,...,q_n)$ and obtain their corresponding video segments $V_1, V_2,..., V_n=Grounding(q_1,q_2,...,q_n)$. In the stage of Text-guided Video Editing, each received video segment would go through video editing and form the generated video $V'=Editing(V,q')$ with a unified subject or scenario. The obtained edited videos $V'_1, V'_2,..., V'_n$ would be smoothly combined into a long video using the Video Morphing method. The Personalization Finetuning is optional to replace the diffusion model for generating videos with customized subjects.
Figure 2: Qualitative example results. (a),(c) and (e) are example videos generated through our method, including the customized subject or scenario in the text queries represented by the bold characters. While (b), (d), and (f) are videos generated through the baseline model.
Figure 3: Example combined video using multi-sentence video grounding for long video generation. The non-bold texts represent queries for video grounding, while bold text represents a portion of the content in the query being replaced in video editing to generate a customized subject.
Figure 4: Failure video grounding examples. The video grounding model fails to retrieve the correct video segments that exist in the video dataset.
Figure 5: Failure video grounding examples. The video grounding dataset lacks video segments of some specific subjects such as snake or submarine.

Multi-sentence Video Grounding for Long Video Generation

TL;DR

Abstract

Multi-sentence Video Grounding for Long Video Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)