Table of Contents
Fetching ...

DiViD: Disentangled Video Diffusion for Static-Dynamic Factorization

Marzieh Gheisari, Auguste Genovesio

TL;DR

DiViD addresses the challenge of unsupervised static–dynamic disentanglement in video by introducing an end-to-end diffusion framework. It factorizes each video into a global static token $s$ and per-frame dynamic tokens $d_i$, and reconstructs frames via a conditional DDPM decoder that leverages shared noise across frames. Key contributions include architectural leakage mitigation through residual encoding, a time-varying KL-based information bottleneck, cross-attention interactions that route static and dynamic information, and an orthogonality regularizer to prevent leakage; experiments on MHAD and MEAD show superior joint swap accuracy and reduced cross-leakage. The approach advances high-fidelity video generation and controllable manipulation of appearance and motion, with practical implications for synthesis, transfer, and analysis of real-world video data.

Abstract

Unsupervised disentanglement of static appearance and dynamic motion in video remains a fundamental challenge, often hindered by information leakage and blurry reconstructions in existing VAE- and GAN-based approaches. We introduce DiViD, the first end-to-end video diffusion framework for explicit static-dynamic factorization. DiViD's sequence encoder extracts a global static token from the first frame and per-frame dynamic tokens, explicitly removing static content from the motion code. Its conditional DDPM decoder incorporates three key inductive biases: a shared-noise schedule for temporal consistency, a time-varying KL-based bottleneck that tightens at early timesteps (compressing static information) and relaxes later (enriching dynamics), and cross-attention that routes the global static token to all frames while keeping dynamic tokens frame-specific. An orthogonality regularizer further prevents residual static-dynamic leakage. We evaluate DiViD on real-world benchmarks using swap-based accuracy and cross-leakage metrics. DiViD outperforms state-of-the-art sequential disentanglement methods: it achieves the highest swap-based joint accuracy, preserves static fidelity while improving dynamic transfer, and reduces average cross-leakage.

DiViD: Disentangled Video Diffusion for Static-Dynamic Factorization

TL;DR

DiViD addresses the challenge of unsupervised static–dynamic disentanglement in video by introducing an end-to-end diffusion framework. It factorizes each video into a global static token and per-frame dynamic tokens , and reconstructs frames via a conditional DDPM decoder that leverages shared noise across frames. Key contributions include architectural leakage mitigation through residual encoding, a time-varying KL-based information bottleneck, cross-attention interactions that route static and dynamic information, and an orthogonality regularizer to prevent leakage; experiments on MHAD and MEAD show superior joint swap accuracy and reduced cross-leakage. The approach advances high-fidelity video generation and controllable manipulation of appearance and motion, with practical implications for synthesis, transfer, and analysis of real-world video data.

Abstract

Unsupervised disentanglement of static appearance and dynamic motion in video remains a fundamental challenge, often hindered by information leakage and blurry reconstructions in existing VAE- and GAN-based approaches. We introduce DiViD, the first end-to-end video diffusion framework for explicit static-dynamic factorization. DiViD's sequence encoder extracts a global static token from the first frame and per-frame dynamic tokens, explicitly removing static content from the motion code. Its conditional DDPM decoder incorporates three key inductive biases: a shared-noise schedule for temporal consistency, a time-varying KL-based bottleneck that tightens at early timesteps (compressing static information) and relaxes later (enriching dynamics), and cross-attention that routes the global static token to all frames while keeping dynamic tokens frame-specific. An orthogonality regularizer further prevents residual static-dynamic leakage. We evaluate DiViD on real-world benchmarks using swap-based accuracy and cross-leakage metrics. DiViD outperforms state-of-the-art sequential disentanglement methods: it achieves the highest swap-based joint accuracy, preserves static fidelity while improving dynamic transfer, and reduces average cross-leakage.

Paper Structure

This paper contains 21 sections, 6 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of DiViD. The sequence encoder decomposes an input video into a shared static token and frame-specific dynamic tokens, which condition the U-Net denoiser via cross-attention in the diffusion decoder.
  • Figure 2: Static–dynamic factor swapping on a MHAD example. From top to bottom: static source (first frame repeated for our method and DBSE; full sequence for SPYL), dynamic source, and swapped outputs by our method, DBSE, and SPYL. Our method cleanly preserves identity while transferring the action; DBSE retains identity but fails to transfer motion; SPYL transfers motion at the cost of appearance fidelity.
  • Figure 3: Static–dynamic factor swapping on a MEAD example. From top to bottom: static source (first frame for our method and DBSE; full sequence for SPYL), dynamic source, and swapped outputs by our method, DBSE, and SPYL. Our method accurately transfers facial expressions while preserving identity; DBSE under-transfers expression dynamics; SPYL mixes appearance and motion, losing fidelity on both.
  • Figure 4: Failure mode of SPYL on a MHAD example. The top row shows the static source (first frame), the middle row the dynamic source, and the bottom row SPYL’s “swapped” output. Note how SPYL merely copies the target motion—failing to preserve the source identity—illustrating that high dynamic‐only accuracy can mask a lack of true disentanglement.