Multi-sentence Video Grounding for Long Video Generation
Wei Feng, Xin Wang, Hong Chen, Zeyang Zhang, Wenwu Zhu
TL;DR
This work tackles the challenge of long video generation by introducing a retrieval-augmented framework based on Multi-sentence Video Grounding. It retrieves grounded source video moments via Massive Video Moment Retrieval and Reliable Mutual Matching Network, edits them with text-guided diffusion (DDIM) while leveraging ControlNet conditioning to preserve temporal coherence, and optionally morphs and personalizes subjects using LoRA and DreamBooth-style fine-tuning. Key contributions include the first study of applying multi-sentence grounding to long video generation, an end-to-end pipeline Grounding(q1,...,qn) → Editing → V' with morphing/personalization, and extensive ablations demonstrating improved subject consistency and reduced flickering on multiple datasets. The approach reduces memory cost by segment-wise editing and offers practical potential for scalable, retrieval-augmented long-video creation, with clear avenues for strengthening grounding models and expanding video corpora in future work.
Abstract
Video generation has witnessed great success recently, but their application in generating long videos still remains challenging due to the difficulty in maintaining the temporal consistency of generated videos and the high memory cost during generation. To tackle the problems, in this paper, we propose a brave and new idea of Multi-sentence Video Grounding for Long Video Generation, connecting the massive video moment retrieval to the video generation task for the first time, providing a new paradigm for long video generation. The method of our work can be summarized as three steps: (i) We design sequential scene text prompts as the queries for video grounding, utilizing the massive video moment retrieval to search for video moment segments that meet the text requirements in the video database. (ii) Based on the source frames of retrieved video moment segments, we adopt video editing methods to create new video content while preserving the temporal consistency of the retrieved video. Since the editing can be conducted segment by segment, and even frame by frame, it largely reduces the memory cost. (iii) We also attempt video morphing and personalized generation methods to improve the subject consistency of long video generation, providing ablation experimental results for the subtasks of long video generation. Our approach seamlessly extends the development in image/video editing, video morphing and personalized generation, and video grounding to the long video generation, offering effective solutions for generating long videos at low memory cost.
