Table of Contents
Fetching ...

Factorized Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models

Mariam Hassan, Bastien Van Delft, Wuyang Li, Alexandre Alahi

TL;DR

This work identifies that failure modes in state-of-the-art Text-to-Video diffusion arise from poor initial frame grounding. It introduces Factorized Video Generation (FVG), decoupling scene construction (Reasoning and Composition) from temporal synthesis, achieved via an LLM-driven first-frame description, a T2I-generated anchor, and a video diffusion model conditioned on the anchor. Empirical results show SOTA gains on T2V-CompBench and improvements on VBench2, with a major reduction in sampling steps without performance loss, highlighting practical speedups. The approach emphasizes grounding as a complementary design to scaling, exposes evaluation gaps, and provides resources to advance robust, controllable video synthesis.

Abstract

State-of-the-art Text-to-Video (T2V) diffusion models can generate visually impressive results, yet they still frequently fail to compose complex scenes or follow logical temporal instructions. In this paper, we argue that many errors, including apparent motion failures, originate from the model's inability to construct a semantically correct or logically consistent initial frame. We introduce Factorized Video Generation (FVG), a pipeline that decouples these tasks by decomposing the Text-to-Video generation into three specialized stages: (1) Reasoning, where a Large Language Model (LLM) rewrites the video prompt to describe only the initial scene, resolving temporal ambiguities; (2) Composition, where a Text-to-Image (T2I) model synthesizes a high-quality, compositionally-correct anchor frame from this new prompt; and (3) Temporal Synthesis, where a video model, finetuned to understand this anchor, focuses its entire capacity on animating the scene and following the prompt. Our decomposed approach sets a new state-of-the-art on the T2V CompBench benchmark and significantly improves all tested models on VBench2. Furthermore, we show that visual anchoring allows us to cut the number of sampling steps by 70% without any loss in performance, leading to a substantial speed-up in sampling. Factorized Video Generation offers a simple yet practical path toward more efficient, robust, and controllable video synthesis

Factorized Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models

TL;DR

This work identifies that failure modes in state-of-the-art Text-to-Video diffusion arise from poor initial frame grounding. It introduces Factorized Video Generation (FVG), decoupling scene construction (Reasoning and Composition) from temporal synthesis, achieved via an LLM-driven first-frame description, a T2I-generated anchor, and a video diffusion model conditioned on the anchor. Empirical results show SOTA gains on T2V-CompBench and improvements on VBench2, with a major reduction in sampling steps without performance loss, highlighting practical speedups. The approach emphasizes grounding as a complementary design to scaling, exposes evaluation gaps, and provides resources to advance robust, controllable video synthesis.

Abstract

State-of-the-art Text-to-Video (T2V) diffusion models can generate visually impressive results, yet they still frequently fail to compose complex scenes or follow logical temporal instructions. In this paper, we argue that many errors, including apparent motion failures, originate from the model's inability to construct a semantically correct or logically consistent initial frame. We introduce Factorized Video Generation (FVG), a pipeline that decouples these tasks by decomposing the Text-to-Video generation into three specialized stages: (1) Reasoning, where a Large Language Model (LLM) rewrites the video prompt to describe only the initial scene, resolving temporal ambiguities; (2) Composition, where a Text-to-Image (T2I) model synthesizes a high-quality, compositionally-correct anchor frame from this new prompt; and (3) Temporal Synthesis, where a video model, finetuned to understand this anchor, focuses its entire capacity on animating the scene and following the prompt. Our decomposed approach sets a new state-of-the-art on the T2V CompBench benchmark and significantly improves all tested models on VBench2. Furthermore, we show that visual anchoring allows us to cut the number of sampling steps by 70% without any loss in performance, leading to a substantial speed-up in sampling. Factorized Video Generation offers a simple yet practical path toward more efficient, robust, and controllable video synthesis

Paper Structure

This paper contains 37 sections, 4 equations, 10 figures, 8 tables, 1 algorithm.

Figures (10)

  • Figure 1: Example failure modes in SoTA video generative foundational models. We compare the first frame of videos from T2V Wan 2.2 wan2025wan and our Factorized Wan 2.2. T2V Wan 2.2 composes scenes incorrectly and exhibits logical temporal inconsistencies, struggling to establish coherent scene structure without explicit visual grounding. This can be solved via our factorization.
  • Figure 2: Overview of the finetuning and inference pipelines for our factorized video generation. Anchor-Grounding Finetuning: The Video Diffusion Model (VDM) is trained to follow a visual anchor by injecting a clean image latent at a randomly chosen frame position and setting its diffusion timestep to $t=0$. Lightweight LoRA finetuning trains the model to treat this clean frame as a fixed scene constraint that guides the rest of the video. Factorized Inference Pipeline: a LLM modifies the video prompt into a first-frame descriptive prompt to generate an anchor image. The anchor is then injected into the VDM with the timestep fixed set to $0$.
  • Figure 3: Qualitative results showing our factorized method leads to better performance. Additional results are provided in the appendix.
  • Figure 4: Percentage change in WAN 2.2 (5B) performance relative to the 50-step baseline. Both T2V variants degrade as steps decrease, while the factorized model remains stable even at 15 steps.
  • Figure 5: Overall qualitative results comparing our factorized method against T2V methods. Visual examples from Wan14B showing that factorized T2V produces more coherent scene layouts and semantically aligned compositions than text-only T2V methods..
  • ...and 5 more figures