Table of Contents
Fetching ...

Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization

Jialu Li, Shoubin Yu, Han Lin, Jaemin Cho, Jaehong Yoon, Mohit Bansal

TL;DR

Video-MSG addresses the challenge of aligning text prompts with complex spatio-temporal requirements in text-to-video diffusion. By separating background planning, foreground layout, and structured noise initialization, it achieves training-free, memory-efficient guidance that leverages multimodal planning and inversion rather than backbone fine-tuning. The method yields substantial gains in motion binding, spatial relationships, and numeracy across multiple backbones and benchmarks, while enabling easier integration with large T2V models. This approach demonstrates the practical potential of multimodal planning and noise-inversion strategies to improve controllability in video generation with minimal additional memory during inference.

Abstract

Recent advancements in text-to-video (T2V) diffusion models have significantly enhanced the visual quality of the generated videos. However, even recent T2V models find it challenging to follow text descriptions accurately, especially when the prompt requires accurate control of spatial layouts or object trajectories. A recent line of research uses layout guidance for T2V models that require fine-tuning or iterative manipulation of the attention map during inference time. This significantly increases the memory requirement, making it difficult to adopt a large T2V model as a backbone. To address this, we introduce Video-MSG, a training-free Guidance method for T2V generation based on Multimodal planning and Structured noise initialization. Video-MSG consists of three steps, where in the first two steps, Video-MSG creates Video Sketch, a fine-grained spatio-temporal plan for the final video, specifying background, foreground, and object trajectories, in the form of draft video frames. In the last step, Video-MSG guides a downstream T2V diffusion model with Video Sketch through noise inversion and denoising. Notably, Video-MSG does not need fine-tuning or attention manipulation with additional memory during inference time, making it easier to adopt large T2V models. Video-MSG demonstrates its effectiveness in enhancing text alignment with multiple T2V backbones (VideoCrafter2 and CogVideoX-5B) on popular T2V generation benchmarks (T2VCompBench and VBench). We provide comprehensive ablation studies about noise inversion ratio, different background generators, background object detection, and foreground object segmentation.

Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization

TL;DR

Video-MSG addresses the challenge of aligning text prompts with complex spatio-temporal requirements in text-to-video diffusion. By separating background planning, foreground layout, and structured noise initialization, it achieves training-free, memory-efficient guidance that leverages multimodal planning and inversion rather than backbone fine-tuning. The method yields substantial gains in motion binding, spatial relationships, and numeracy across multiple backbones and benchmarks, while enabling easier integration with large T2V models. This approach demonstrates the practical potential of multimodal planning and noise-inversion strategies to improve controllability in video generation with minimal additional memory during inference.

Abstract

Recent advancements in text-to-video (T2V) diffusion models have significantly enhanced the visual quality of the generated videos. However, even recent T2V models find it challenging to follow text descriptions accurately, especially when the prompt requires accurate control of spatial layouts or object trajectories. A recent line of research uses layout guidance for T2V models that require fine-tuning or iterative manipulation of the attention map during inference time. This significantly increases the memory requirement, making it difficult to adopt a large T2V model as a backbone. To address this, we introduce Video-MSG, a training-free Guidance method for T2V generation based on Multimodal planning and Structured noise initialization. Video-MSG consists of three steps, where in the first two steps, Video-MSG creates Video Sketch, a fine-grained spatio-temporal plan for the final video, specifying background, foreground, and object trajectories, in the form of draft video frames. In the last step, Video-MSG guides a downstream T2V diffusion model with Video Sketch through noise inversion and denoising. Notably, Video-MSG does not need fine-tuning or attention manipulation with additional memory during inference time, making it easier to adopt large T2V models. Video-MSG demonstrates its effectiveness in enhancing text alignment with multiple T2V backbones (VideoCrafter2 and CogVideoX-5B) on popular T2V generation benchmarks (T2VCompBench and VBench). We provide comprehensive ablation studies about noise inversion ratio, different background generators, background object detection, and foreground object segmentation.

Paper Structure

This paper contains 26 sections, 2 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Comparison of different text-to-video generation methods: (a) single model for video generation, (b) video generation with (attention-based) layout guidance, and our (c) Video-MSG, a training-free guidance method for T2V generation based on multimodal planning and structured noise initialization. Since Video-MSG does not need fine-tuning or additional memory during inference time, it is easier to adopt large T2V models than previous video layout guidance methods based on fine-tuning or iterative attention manipulation.
  • Figure 2: Three stages of Video-MSG. In the first stage, the MLLM plans specific global and local contexts that fit the provided text-to-video prompt. The text-to-image (T2I) model uses the MLLM planned context to render the necessary components of the video. In the third stage, we generate video with video sketch via noise inversion.
  • Figure 3: Videos generated with CogVideoX-5B and Video-MSG with CogVideoX-5B backbone. The videos generated with Video-MSG are more accurate regarding object motions, numeracy, and spatial relationships.
  • Figure 4: Example video showing the importance of background object detection in foreground object placement.
  • Figure 5: Example video showing the importance of foreground object segmentation.
  • ...and 3 more figures