Table of Contents
Fetching ...

Controllable Video Generation with Provable Disentanglement

Yifan Shen, Peiyuan Zhu, Zijian Li, Shaoan Xie, Namrata Deka, Zongfang Liu, Zeyu Tang, Guangyi Chen, Kun Zhang

TL;DR

This work tackles the challenge of fine-grained controllable video generation by formulating a disentangled latent model with static content ${\mathbf{z}}^c$ and time-varying style dynamics ${\mathbf{z}}_t^s$, governed by a stationary nonlinear causal process. The authors introducir the Temporal Transition Module (TTM) within a StyleGAN2-ADA–based GAN (CoVoGAN) to enforce minimal and sufficient change, yielding block-wise and component-wise identifiability guarantees. They prove identifiability theorems under mild assumptions and validate the approach with extensive experiments across FaceForensics, SkyTimelapse, RealEstate10K, and CelebV-HQ, showing superior video quality (FVD) and disentanglement (MCC, SAP, Modularity) and demonstrating robust, interpretable control over motion components. The approach achieves efficient inference and provides a principled framework for disentangled, controllable video synthesis with potential applications in animation, simulation, and media generation.

Abstract

Controllable video generation remains a significant challenge, despite recent advances in generating high-quality and consistent videos. Most existing methods for controlling video generation treat the video as a whole, neglecting intricate fine-grained spatiotemporal relationships, which limits both control precision and efficiency. In this paper, we propose Controllable Video Generative Adversarial Networks (CoVoGAN) to disentangle the video concepts, thus facilitating efficient and independent control over individual concepts. Specifically, following the minimal change principle, we first disentangle static and dynamic latent variables. We then leverage the sufficient change property to achieve component-wise identifiability of dynamic latent variables, enabling disentangled control of video generation. To establish the theoretical foundation, we provide a rigorous analysis demonstrating the identifiability of our approach. Building on these theoretical insights, we design a Temporal Transition Module to disentangle latent dynamics. To enforce the minimal change principle and sufficient change property, we minimize the dimensionality of latent dynamic variables and impose temporal conditional independence. To validate our approach, we integrate this module as a plug-in for GANs. Extensive qualitative and quantitative experiments on various video generation benchmarks demonstrate that our method significantly improves generation quality and controllability across diverse real-world scenarios.

Controllable Video Generation with Provable Disentanglement

TL;DR

This work tackles the challenge of fine-grained controllable video generation by formulating a disentangled latent model with static content and time-varying style dynamics , governed by a stationary nonlinear causal process. The authors introducir the Temporal Transition Module (TTM) within a StyleGAN2-ADA–based GAN (CoVoGAN) to enforce minimal and sufficient change, yielding block-wise and component-wise identifiability guarantees. They prove identifiability theorems under mild assumptions and validate the approach with extensive experiments across FaceForensics, SkyTimelapse, RealEstate10K, and CelebV-HQ, showing superior video quality (FVD) and disentanglement (MCC, SAP, Modularity) and demonstrating robust, interpretable control over motion components. The approach achieves efficient inference and provides a principled framework for disentangled, controllable video synthesis with potential applications in animation, simulation, and media generation.

Abstract

Controllable video generation remains a significant challenge, despite recent advances in generating high-quality and consistent videos. Most existing methods for controlling video generation treat the video as a whole, neglecting intricate fine-grained spatiotemporal relationships, which limits both control precision and efficiency. In this paper, we propose Controllable Video Generative Adversarial Networks (CoVoGAN) to disentangle the video concepts, thus facilitating efficient and independent control over individual concepts. Specifically, following the minimal change principle, we first disentangle static and dynamic latent variables. We then leverage the sufficient change property to achieve component-wise identifiability of dynamic latent variables, enabling disentangled control of video generation. To establish the theoretical foundation, we provide a rigorous analysis demonstrating the identifiability of our approach. Building on these theoretical insights, we design a Temporal Transition Module to disentangle latent dynamics. To enforce the minimal change principle and sufficient change property, we minimize the dimensionality of latent dynamic variables and impose temporal conditional independence. To validate our approach, we integrate this module as a plug-in for GANs. Extensive qualitative and quantitative experiments on various video generation benchmarks demonstrate that our method significantly improves generation quality and controllability across diverse real-world scenarios.

Paper Structure

This paper contains 36 sections, 4 theorems, 24 equations, 9 figures, 12 tables.

Key Result

Theorem 3.2

Consider video observation $V = \{{\mathbf{x}}_1, {\mathbf{x}}_2, \dots, {\mathbf{x}}_T\}$ generated by process $(g, \textbf{f}^s, \textbf{f}^c, \textbf{p}^s, \textbf{p}^c)$ with latent variables denoted as ${\mathbf{z}}_t^s$ and ${\mathbf{z}}_c$, according to Equation eq:generation, where ${\mathbf are satisfied, then ${\mathbf{z}}_t$ is block-wise identifiable with regard to $\hat{{\mathbf{z}}}_

Figures (9)

  • Figure 1: Videos are generated using Kling and Wan with the prompt: "while this person was speaking, the head gradually shifted from the middle to the right." The first row shows that essential motion cues are partially omitted, while in the second row the head size changes undesirably.
  • Figure 2: The generating process. The gray shade of nodes indicates that the variable is observable.
  • Figure 3: Generator operates from left to right, beginning with a random noise input. The noise first passes through a Temporal Transition Module, which produces a disentangled representation of the underlying factors. This representation is then fed into the synthesis network to generate frames at the pixel level. In the figure, the blue arrow illustrates the Deep Sigmoid Flow.
  • Figure 4: Controllability in the latent space across datasets and methods. Each method is evaluated with three samples by varying a single latent dimension. Only CoVoGAN exhibits consistent control across identities: (a) head pose adjustment on FaceForensics, (b) camera translation on RealEstate.
  • Figure 5: Controllability visualization results on the FaceForensics dataset. Two distinct motion concepts are manipulated to illustrate component-wise disentanglement. Corresponding videos are provided in the supplementary materials for better visualization.
  • ...and 4 more figures

Theorems & Definitions (16)

  • Definition 2.1: Observational Equivalence
  • Definition 2.2: Block-wise Identification of Generating Process
  • Definition 2.3: Component-wise Identification of Style Dynamics
  • Definition 3.1
  • Theorem 3.2: Block-wise Identifiability
  • Theorem 3.3: Component-wise Identifiability
  • Theorem A1: Blockwise Identifiability
  • proof
  • Theorem A2: Component-wise Identifiability
  • proof
  • ...and 6 more