CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion
Dianbing Xi, Jiepeng Wang, Yuanzhi Liang, Xi Qiu, Jialun Liu, Hao Pan, Yuchi Huo, Rui Wang, Haibin Huang, Chi Zhang, Xuelong Li
TL;DR
CtrlVDiff presents a unified diffusion framework for controllable video generation and understanding that jointly models geometry, appearance, semantics, and structure through eight modalities. It introduces the Hybrid Modality Control Strategy (HMCS) to flexibly route conditioning and target modalities, paired with MMVideo, a large real-and-synthetic multimodal dataset, enabling robust cross-modal learning. The method achieves state-of-the-art or competitive results across depth, segmentation, normals, and material estimation, while enabling practical edits such as relighting, material swapping, and object insertion with strong temporal coherence. By reducing reliance on external estimators and unifying generation and understanding in a single model, CtrlVDiff offers versatile capabilities for high-fidelity, interpretable, and controllable video synthesis across real and synthetic domains.
Abstract
We tackle the dual challenges of video understanding and controllable video generation within a unified diffusion framework. Our key insights are two-fold: geometry-only cues (e.g., depth, edges) are insufficient: they specify layout but under-constrain appearance, materials, and illumination, limiting physically meaningful edits such as relighting or material swaps and often causing temporal drift. Enriching the model with additional graphics-based modalities (intrinsics and semantics) provides complementary constraints that both disambiguate understanding and enable precise, predictable control during generation. However, building a single model that uses many heterogeneous cues introduces two core difficulties. Architecturally, the model must accept any subset of modalities, remain robust to missing inputs, and inject control signals without sacrificing temporal consistency. Data-wise, training demands large-scale, temporally aligned supervision that ties real videos to per-pixel multimodal annotations. We then propose CtrlVDiff, a unified diffusion model trained with a Hybrid Modality Control Strategy (HMCS) that routes and fuses features from depth, normals, segmentation, edges, and graphics-based intrinsics (albedo, roughness, metallic), and re-renders videos from any chosen subset with strong temporal coherence. To enable this, we build MMVideo, a hybrid real-and-synthetic dataset aligned across modalities and captions. Across understanding and generation benchmarks, CtrlVDiff delivers superior controllability and fidelity, enabling layer-wise edits (relighting, material adjustment, object insertion) and surpassing state-of-the-art baselines while remaining robust when some modalities are unavailable.
