Table of Contents
Fetching ...

FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax

Yu Lu, Linchao Zhu, Hehe Fan, Yi Yang

TL;DR

<3-5 sentence high-level summary> FlowZero tackles zero-shot text-to-video generation by coupling LLM-driven Dynamic Scene Syntax with diffusion-based synthesis to produce temporally coherent videos from prompts. It introduces DSS, comprising per-frame scene descriptions, foreground layouts, and background motion, and employs an iterative self-refinement loop to align layouts with textual prompts. A motion-guided noise shifting mechanism encodes background and camera motion into frame initializations, and a cross-attention-enabled U-Net ensures frame-to-frame coherence. Through qualitative and quantitative evaluations, FlowZero outperforms several zero-shot baselines and demonstrates the importance of structured spatio-temporal planning for realistic T2V.

Abstract

Text-to-video (T2V) generation is a rapidly growing research area that aims to translate the scenes, objects, and actions within complex video text into a sequence of coherent visual frames. We present FlowZero, a novel framework that combines Large Language Models (LLMs) with image diffusion models to generate temporally-coherent videos. FlowZero uses LLMs to understand complex spatio-temporal dynamics from text, where LLMs can generate a comprehensive dynamic scene syntax (DSS) containing scene descriptions, object layouts, and background motion patterns. These elements in DSS are then used to guide the image diffusion model for video generation with smooth object motions and frame-to-frame coherence. Moreover, FlowZero incorporates an iterative self-refinement process, enhancing the alignment between the spatio-temporal layouts and the textual prompts for the videos. To enhance global coherence, we propose enriching the initial noise of each frame with motion dynamics to control the background movement and camera motion adaptively. By using spatio-temporal syntaxes to guide the diffusion process, FlowZero achieves improvement in zero-shot video synthesis, generating coherent videos with vivid motion.

FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax

TL;DR

<3-5 sentence high-level summary> FlowZero tackles zero-shot text-to-video generation by coupling LLM-driven Dynamic Scene Syntax with diffusion-based synthesis to produce temporally coherent videos from prompts. It introduces DSS, comprising per-frame scene descriptions, foreground layouts, and background motion, and employs an iterative self-refinement loop to align layouts with textual prompts. A motion-guided noise shifting mechanism encodes background and camera motion into frame initializations, and a cross-attention-enabled U-Net ensures frame-to-frame coherence. Through qualitative and quantitative evaluations, FlowZero outperforms several zero-shot baselines and demonstrates the importance of structured spatio-temporal planning for realistic T2V.

Abstract

Text-to-video (T2V) generation is a rapidly growing research area that aims to translate the scenes, objects, and actions within complex video text into a sequence of coherent visual frames. We present FlowZero, a novel framework that combines Large Language Models (LLMs) with image diffusion models to generate temporally-coherent videos. FlowZero uses LLMs to understand complex spatio-temporal dynamics from text, where LLMs can generate a comprehensive dynamic scene syntax (DSS) containing scene descriptions, object layouts, and background motion patterns. These elements in DSS are then used to guide the image diffusion model for video generation with smooth object motions and frame-to-frame coherence. Moreover, FlowZero incorporates an iterative self-refinement process, enhancing the alignment between the spatio-temporal layouts and the textual prompts for the videos. To enhance global coherence, we propose enriching the initial noise of each frame with motion dynamics to control the background movement and camera motion adaptively. By using spatio-temporal syntaxes to guide the diffusion process, FlowZero achieves improvement in zero-shot video synthesis, generating coherent videos with vivid motion.
Paper Structure (18 sections, 1 equation, 6 figures, 2 tables)

This paper contains 18 sections, 1 equation, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Zero-shot text-to-video generation. We present a new framework for text-to-video generation with exceptional temporal coherence, featuring realistic object movements, transformations, and background motion within the generated videos.
  • Figure 2: Overview of FlowZero. Starting from a video prompt, we first instruct the LLMs (i.e., GPT4) to generate serial frame-by-frame syntax, including scene descriptions, foreground layouts, and background motion patterns. We employ an iterative self-refinement process to improve the generated spatio-temporal layouts. This process includes implementing a feedback loop where the LLM autonomously verifies and rectifies the spatial and temporal errors of the initial layouts. The loop continues until the confidence score $C$ for the modified layouts exceeds a predefined threshold $\lambda$. Next, we perform motion-guided noise shifting (MNS) to obtain the initial noise for each frame $i$ by shifting the first noise with predicted background motion direcction $d_{i}$ and speed $s_{i}$. Then, a U-Net with cross-attention, gated attention, and cross-frame attention is used to obtain $N$ coherent video frames.
  • Figure 3: Qualitative comparison. Our method can capture detailed object motion to generate temporally coherent frame sequences.
  • Figure 4: Qualitative comparison. Our method can model intricate object transformations representing narrative structures in the video prompt.
  • Figure 5: Ablation studies of the effectiveness of FlowZero. (A) cross-frame attention, (B) scene descriptions, (C) foreground layouts, (D) motion-guided noise shifting.
  • ...and 1 more figures