Table of Contents
Fetching ...

Ctrl-VI: Controllable Video Synthesis via Variational Inference

Haoyi Duan, Yunzhi Zhang, Yilun Du, Jiajun Wu

TL;DR

Ctrl-VI tackles flexible user control in video synthesis by framing generation as sampling from a target distribution formed by a product of backbones, i.e., $p^*(x|y) \propto \prod_i p^{(i)}(x|y^{(i)})$, and minimizing $ \mathrm{KL}(q||p^*)$ via an annealed sequence of targets. It combines SVGD with a context-conditioned factorization to reduce multimodal modes and improve 3D consistency, enabling mixed inputs from text, images, camera trajectories, and 3D asset trajectories. The framework instantiates backbones for image-to-video, depth/flow, and trajectory conditioning, with adaptive masks and context priors. Experiments show Ctrl-VI yields improved controllability, diversity, and scene coherence compared to fixed-form baselines and PoE approaches, and extends to longer sequences with robust background fidelity.

Abstract

Many video workflows benefit from a mixture of user controls with varying granularity, from exact 4D object trajectories and camera paths to coarse text prompts, while existing video generative models are typically trained for fixed input formats. We develop Ctrl-VI, a video synthesis method that addresses this need and generates samples with high controllability for specified elements while maintaining diversity for under-specified ones. We cast the task as variational inference to approximate a composed distribution, leveraging multiple video generation backbones to account for all task constraints collectively. To address the optimization challenge, we break down the problem into step-wise KL divergence minimization over an annealed sequence of distributions, and further propose a context-conditioned factorization technique that reduces modes in the solution space to circumvent local optima. Experiments suggest that our method produces samples with improved controllability, diversity, and 3D consistency compared to prior works.

Ctrl-VI: Controllable Video Synthesis via Variational Inference

TL;DR

Ctrl-VI tackles flexible user control in video synthesis by framing generation as sampling from a target distribution formed by a product of backbones, i.e., , and minimizing via an annealed sequence of targets. It combines SVGD with a context-conditioned factorization to reduce multimodal modes and improve 3D consistency, enabling mixed inputs from text, images, camera trajectories, and 3D asset trajectories. The framework instantiates backbones for image-to-video, depth/flow, and trajectory conditioning, with adaptive masks and context priors. Experiments show Ctrl-VI yields improved controllability, diversity, and scene coherence compared to fixed-form baselines and PoE approaches, and extends to longer sequences with robust background fidelity.

Abstract

Many video workflows benefit from a mixture of user controls with varying granularity, from exact 4D object trajectories and camera paths to coarse text prompts, while existing video generative models are typically trained for fixed input formats. We develop Ctrl-VI, a video synthesis method that addresses this need and generates samples with high controllability for specified elements while maintaining diversity for under-specified ones. We cast the task as variational inference to approximate a composed distribution, leveraging multiple video generation backbones to account for all task constraints collectively. To address the optimization challenge, we break down the problem into step-wise KL divergence minimization over an annealed sequence of distributions, and further propose a context-conditioned factorization technique that reduces modes in the solution space to circumvent local optima. Experiments suggest that our method produces samples with improved controllability, diversity, and 3D consistency compared to prior works.

Paper Structure

This paper contains 22 sections, 4 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Task Overview. This work develops a controllable video synthesis framework, Ctrl-VI, where different forms of user inputs (top left) are supported and can be flexibly mixed for each generation pass. The top-right shows an output sample. Users can further change any prompt element to achieve different levels of control over outputs (bottom).
  • Figure 2: Overview of our method Ctrl-VI.Top-left: Input specification, including text prompt $\mathcal{Y}$, camera trajectory $\mathcal{C}$, input image pair $\{\mathcal{I}_{\text{bg}}, \mathcal{I}_{\text{fg}}\}$, and 3D asset trajectory. Bottom-left: We then compute context conditionals $z_t^{\text{context}}$ that provide background priors for maintaining scene consistency. Bottom-center: Foreground masks $\mathcal{M}_{\text{fg}}$ and simulation masks $\mathcal{M}_{\text{sim}}$ define the regions handled by their respective models, while $\mathcal{M}_{\text{context}}$ specifies the region for context conditionals. Showing one particle for simplicity. Right: Example frames from the generated video.
  • Figure 3: Qualitative Results. For each test case, input object simulation and camera trajectory are visualized, with text prompts shown at the bottom. Our method generates videos aligned with object and camera trajectories while exhibiting natural and coherent content for unconstrained regions.
  • Figure 4: Output Diversity. A set of particles is obtained during the proposed optimization procedure for population-level sampling. The above contains visualizations for two particles from the same optimization. They both follow the input conditions (fire trajectory and text prompts), while presenting diversity in under-specified regions, e.g., human.
  • Figure 5: Baseline Comparisons. Failure modes of baselines include fixed cameras under out-of-distribution trajectories (e.g., "orbit right" on the left) and unnatural object drifting (right), while ours achieves better controllability by incorporating constraints from multiple model components.
  • ...and 2 more figures