Table of Contents
Fetching ...

Enabling Versatile Controls for Video Diffusion Models

Xu Zhang, Hao Zhou, Haoming Qin, Xiaobin Lu, Jiaxing Yan, Guanzhong Wang, Zeyu Chen, Yi Liu

TL;DR

We address the challenge of fine-grained spatiotemporal control in video diffusion by proposing VCtrl, a unified conditioning framework that plugs into pretrained video diffusion models without modifying the base generator. It introduces a Control Encoder to unify diverse control signals (e.g., canny edges, masks, pose keypoints) and a sparse residual VCtrl module to inject guidance efficiently, aided by a data-filtering pipeline to improve semantic alignment. Extensive experiments and human evaluations demonstrate improved control fidelity and video quality across Canny, Mask, and Pose conditioning tasks, outperforming task-specific baselines. The approach is lightweight, modular, and compatible with various base architectures, enabling broad applicability to controllable video synthesis research and applications.

Abstract

Despite substantial progress in text-to-video generation, achieving precise and flexible control over fine-grained spatiotemporal attributes remains a significant unresolved challenge in video generation research. To address these limitations, we introduce VCtrl (also termed PP-VCtrl), a novel framework designed to enable fine-grained control over pre-trained video diffusion models in a unified manner. VCtrl integrates diverse user-specified control signals-such as Canny edges, segmentation masks, and human keypoints-into pretrained video diffusion models via a generalizable conditional module capable of uniformly encoding multiple types of auxiliary signals without modifying the underlying generator. Additionally, we design a unified control signal encoding pipeline and a sparse residual connection mechanism to efficiently incorporate control representations. Comprehensive experiments and human evaluations demonstrate that VCtrl effectively enhances controllability and generation quality. The source code and pre-trained models are publicly available and implemented using the PaddlePaddle framework at http://github.com/PaddlePaddle/PaddleMIX/tree/develop/ppdiffusers/examples/ppvctrl.

Enabling Versatile Controls for Video Diffusion Models

TL;DR

We address the challenge of fine-grained spatiotemporal control in video diffusion by proposing VCtrl, a unified conditioning framework that plugs into pretrained video diffusion models without modifying the base generator. It introduces a Control Encoder to unify diverse control signals (e.g., canny edges, masks, pose keypoints) and a sparse residual VCtrl module to inject guidance efficiently, aided by a data-filtering pipeline to improve semantic alignment. Extensive experiments and human evaluations demonstrate improved control fidelity and video quality across Canny, Mask, and Pose conditioning tasks, outperforming task-specific baselines. The approach is lightweight, modular, and compatible with various base architectures, enabling broad applicability to controllable video synthesis research and applications.

Abstract

Despite substantial progress in text-to-video generation, achieving precise and flexible control over fine-grained spatiotemporal attributes remains a significant unresolved challenge in video generation research. To address these limitations, we introduce VCtrl (also termed PP-VCtrl), a novel framework designed to enable fine-grained control over pre-trained video diffusion models in a unified manner. VCtrl integrates diverse user-specified control signals-such as Canny edges, segmentation masks, and human keypoints-into pretrained video diffusion models via a generalizable conditional module capable of uniformly encoding multiple types of auxiliary signals without modifying the underlying generator. Additionally, we design a unified control signal encoding pipeline and a sparse residual connection mechanism to efficiently incorporate control representations. Comprehensive experiments and human evaluations demonstrate that VCtrl effectively enhances controllability and generation quality. The source code and pre-trained models are publicly available and implemented using the PaddlePaddle framework at http://github.com/PaddlePaddle/PaddleMIX/tree/develop/ppdiffusers/examples/ppvctrl.

Paper Structure

This paper contains 19 sections, 11 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Examples generated by VCtrl (also termed PP-VCtrl) using reference frames and text prompts. VCtrl enables users to guide large pretrained video diffusion models using diverse controls, including Canny edges (top), segmentation masks (middle), and human keypoints (bottom), generating high-quality videos that accurately adhere to the provided control signals.
  • Figure 2: Overview architecture of VCtrl. A control signal (e.g., Canny edges, semantic masks, or pose keypoints) is first encoded by the control encoder. The resulting representation is then additively combined with latent and incorporated into the Video Diffusion Model via the proposed VCtrl module, which leverages a sparse residual connection mechanism. After several iterative denoising steps, the refined latent is decoded by a pretrained VAE to produce the final video.
  • Figure 3: Our Data Filtering Pipeline. Videos refined by an aesthetic filter are recaptioned and processed to extract Canny edges, human keypoints, and segmentation masks, providing training data for diverse controllable tasks.
  • Figure 4: Qualitative comparison to previous methods. We compare our method with Control-A-Video chen2023controlavideo and Text2Video-Zero text2video-zero, demonstrating superior visual coherence and stronger adherence to the Canny edge conditions.
  • Figure 5: Control Layouts. (a) Even: control signals uniformly injected throughout the network; (b) End: control signals densely injected toward the end of the network; (c) Space: control signals sparsely and evenly distributed across the network.