Enabling Versatile Controls for Video Diffusion Models

Xu Zhang; Hao Zhou; Haoming Qin; Xiaobin Lu; Jiaxing Yan; Guanzhong Wang; Zeyu Chen; Yi Liu

Enabling Versatile Controls for Video Diffusion Models

Xu Zhang, Hao Zhou, Haoming Qin, Xiaobin Lu, Jiaxing Yan, Guanzhong Wang, Zeyu Chen, Yi Liu

TL;DR

We address the challenge of fine-grained spatiotemporal control in video diffusion by proposing VCtrl, a unified conditioning framework that plugs into pretrained video diffusion models without modifying the base generator. It introduces a Control Encoder to unify diverse control signals (e.g., canny edges, masks, pose keypoints) and a sparse residual VCtrl module to inject guidance efficiently, aided by a data-filtering pipeline to improve semantic alignment. Extensive experiments and human evaluations demonstrate improved control fidelity and video quality across Canny, Mask, and Pose conditioning tasks, outperforming task-specific baselines. The approach is lightweight, modular, and compatible with various base architectures, enabling broad applicability to controllable video synthesis research and applications.

Abstract

Despite substantial progress in text-to-video generation, achieving precise and flexible control over fine-grained spatiotemporal attributes remains a significant unresolved challenge in video generation research. To address these limitations, we introduce VCtrl (also termed PP-VCtrl), a novel framework designed to enable fine-grained control over pre-trained video diffusion models in a unified manner. VCtrl integrates diverse user-specified control signals-such as Canny edges, segmentation masks, and human keypoints-into pretrained video diffusion models via a generalizable conditional module capable of uniformly encoding multiple types of auxiliary signals without modifying the underlying generator. Additionally, we design a unified control signal encoding pipeline and a sparse residual connection mechanism to efficiently incorporate control representations. Comprehensive experiments and human evaluations demonstrate that VCtrl effectively enhances controllability and generation quality. The source code and pre-trained models are publicly available and implemented using the PaddlePaddle framework at http://github.com/PaddlePaddle/PaddleMIX/tree/develop/ppdiffusers/examples/ppvctrl.

Enabling Versatile Controls for Video Diffusion Models

TL;DR

Abstract

Enabling Versatile Controls for Video Diffusion Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)