Table of Contents
Fetching ...

CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion

Dianbing Xi, Jiepeng Wang, Yuanzhi Liang, Xi Qiu, Jialun Liu, Hao Pan, Yuchi Huo, Rui Wang, Haibin Huang, Chi Zhang, Xuelong Li

TL;DR

CtrlVDiff presents a unified diffusion framework for controllable video generation and understanding that jointly models geometry, appearance, semantics, and structure through eight modalities. It introduces the Hybrid Modality Control Strategy (HMCS) to flexibly route conditioning and target modalities, paired with MMVideo, a large real-and-synthetic multimodal dataset, enabling robust cross-modal learning. The method achieves state-of-the-art or competitive results across depth, segmentation, normals, and material estimation, while enabling practical edits such as relighting, material swapping, and object insertion with strong temporal coherence. By reducing reliance on external estimators and unifying generation and understanding in a single model, CtrlVDiff offers versatile capabilities for high-fidelity, interpretable, and controllable video synthesis across real and synthetic domains.

Abstract

We tackle the dual challenges of video understanding and controllable video generation within a unified diffusion framework. Our key insights are two-fold: geometry-only cues (e.g., depth, edges) are insufficient: they specify layout but under-constrain appearance, materials, and illumination, limiting physically meaningful edits such as relighting or material swaps and often causing temporal drift. Enriching the model with additional graphics-based modalities (intrinsics and semantics) provides complementary constraints that both disambiguate understanding and enable precise, predictable control during generation. However, building a single model that uses many heterogeneous cues introduces two core difficulties. Architecturally, the model must accept any subset of modalities, remain robust to missing inputs, and inject control signals without sacrificing temporal consistency. Data-wise, training demands large-scale, temporally aligned supervision that ties real videos to per-pixel multimodal annotations. We then propose CtrlVDiff, a unified diffusion model trained with a Hybrid Modality Control Strategy (HMCS) that routes and fuses features from depth, normals, segmentation, edges, and graphics-based intrinsics (albedo, roughness, metallic), and re-renders videos from any chosen subset with strong temporal coherence. To enable this, we build MMVideo, a hybrid real-and-synthetic dataset aligned across modalities and captions. Across understanding and generation benchmarks, CtrlVDiff delivers superior controllability and fidelity, enabling layer-wise edits (relighting, material adjustment, object insertion) and surpassing state-of-the-art baselines while remaining robust when some modalities are unavailable.

CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion

TL;DR

CtrlVDiff presents a unified diffusion framework for controllable video generation and understanding that jointly models geometry, appearance, semantics, and structure through eight modalities. It introduces the Hybrid Modality Control Strategy (HMCS) to flexibly route conditioning and target modalities, paired with MMVideo, a large real-and-synthetic multimodal dataset, enabling robust cross-modal learning. The method achieves state-of-the-art or competitive results across depth, segmentation, normals, and material estimation, while enabling practical edits such as relighting, material swapping, and object insertion with strong temporal coherence. By reducing reliance on external estimators and unifying generation and understanding in a single model, CtrlVDiff offers versatile capabilities for high-fidelity, interpretable, and controllable video synthesis across real and synthetic domains.

Abstract

We tackle the dual challenges of video understanding and controllable video generation within a unified diffusion framework. Our key insights are two-fold: geometry-only cues (e.g., depth, edges) are insufficient: they specify layout but under-constrain appearance, materials, and illumination, limiting physically meaningful edits such as relighting or material swaps and often causing temporal drift. Enriching the model with additional graphics-based modalities (intrinsics and semantics) provides complementary constraints that both disambiguate understanding and enable precise, predictable control during generation. However, building a single model that uses many heterogeneous cues introduces two core difficulties. Architecturally, the model must accept any subset of modalities, remain robust to missing inputs, and inject control signals without sacrificing temporal consistency. Data-wise, training demands large-scale, temporally aligned supervision that ties real videos to per-pixel multimodal annotations. We then propose CtrlVDiff, a unified diffusion model trained with a Hybrid Modality Control Strategy (HMCS) that routes and fuses features from depth, normals, segmentation, edges, and graphics-based intrinsics (albedo, roughness, metallic), and re-renders videos from any chosen subset with strong temporal coherence. To enable this, we build MMVideo, a hybrid real-and-synthetic dataset aligned across modalities and captions. Across understanding and generation benchmarks, CtrlVDiff delivers superior controllability and fidelity, enabling layer-wise edits (relighting, material adjustment, object insertion) and surpassing state-of-the-art baselines while remaining robust when some modalities are unavailable.

Paper Structure

This paper contains 38 sections, 1 equation, 18 figures, 9 tables, 1 algorithm.

Figures (18)

  • Figure 1: We present CtrlVDiff, a controllable video generation framework via unified multimodal video diffusion. (a) Video Understanding: By leveraging the input video and a text prompt, our model accurately estimates diverse modality representations simultaneously, including depth, normal, segmentation, and canny maps. For material-related modalities, it produces clean and physically plausible albedo, roughness, and metallic outputs. (b) Controllable Video Generation: Using the decomposed multimodal signals and the original prompt as conditions, our model reconstructs temporally consistent videos faithful to the input sequence. (c) Prompt-based Relighting: By altering textual descriptions, our model directly manipulates scene illumination to achieve controllable lighting variations. (d) Material Editing: Adjusting the decomposed albedo modality allows re-rendering with corresponding material property changes. (e) Object Insertion: By editing the decomposed albedo and normal modalities, our framework enables object insertion.
  • Figure 2: Impact of different modality combinations on video generation. Visualization of CtrlVDiff multimodal generation results. (a) Using only depth fails to control facial details and text regions described in the prompt. (b) Combining depth and canny enables control over facial features ($\rightarrow$) and partial text regions ($\rightarrow$). (c) Adding albedo further refines color and texture control, especially for the background mural ($\rightarrow$).
  • Figure 3: Framework overview of CtrlVDiff. Given a video with eight paired modalities, we first encode all modalities into latent representations using a pretrained shared 3D-VAE encoder. For each sample within a batch, its latent features are concatenated along the channel dimension. Subsequently, we apply the HMCS to each batch (as illustrated in the box on the right), which enables robust handling of all possible modality combinations. The outputs of the Diffusion Transformer are then processed through modality specific projection layers, where each modality is assigned an independent projection head to encourage effective modality disentanglement.
  • Figure 4: Qualitative comparison of video depth and segmentation estimation. (a) Video Depth Estimation:VDA-S denotes the Video Depth Anything expert model with a ViT-Small backbone. The $\rightarrow$ highlight that CtrlVDiff consistently predicts accurate depth for fine structures such as thin wires. (b) Video Segmentation Estimation: The $\rightarrow$ indicate regions that are incorrectly segmented into multiple classes due to object occlusion, while the $\rightarrow$ mark ambiguous regions where the segmentation granularity is inconsistent. CtrlVDiff achieves the best performance across both tasks.
  • Figure 5: Qualitative comparison of video normal estimation. NormalCrafter is denoted as NC, and DiffusionRenderer as DR. Both CtrlVDiff and DR (Cosmos) demonstrate superior performance in preserving fine details and surface consistency($\rightarrow$).
  • ...and 13 more figures