Table of Contents
Fetching ...

EasyControl: Transfer ControlNet to Video Diffusion for Controllable Generation and Interpolation

Cong Wang, Jiaxi Gu, Panwen Hu, Haoyu Zhao, Yuanfan Guo, Jianhua Han, Hang Xu, Xiaodan Liang

TL;DR

Text-guided video generation struggles to capture motion-rich content, necessitating flexible control signals. EasyControl introduces a Condition Adapter that converts a single condition map into multi-layer conditioning features injected into pre-trained T2V diffusion models via residual connections, enabling multi-modal control with low training cost. The framework supports diverse modalities (image, depth, edges, sketch, segmentation) and demonstrates superior quality and controllability across UCF101 and MSR-VTT, outperforming state-of-the-art baselines in FVD, IS, and alignment metrics. By reusing pre-trained models and employing latent-aware conditioning, EasyControl lowers barriers to practical, controllable video synthesis and interpolation.

Abstract

Following the advancements in text-guided image generation technology exemplified by Stable Diffusion, video generation is gaining increased attention in the academic community. However, relying solely on text guidance for video generation has serious limitations, as videos contain much richer content than images, especially in terms of motion. This information can hardly be adequately described with plain text. Fortunately, in computer vision, various visual representations can serve as additional control signals to guide generation. With the help of these signals, video generation can be controlled in finer detail, allowing for greater flexibility for different applications. Integrating various controls, however, is nontrivial. In this paper, we propose a universal framework called EasyControl. By propagating and injecting condition features through condition adapters, our method enables users to control video generation with a single condition map. With our framework, various conditions including raw pixels, depth, HED, etc., can be integrated into different Unet-based pre-trained video diffusion models at a low practical cost. We conduct comprehensive experiments on public datasets, and both quantitative and qualitative results indicate that our method outperforms state-of-the-art methods. EasyControl significantly improves various evaluation metrics across multiple validation datasets compared to previous works. Specifically, for the sketch-to-video generation task, EasyControl achieves an improvement of 152.0 on FVD and 19.9 on IS, respectively, in UCF101 compared with VideoComposer. For fidelity, our model demonstrates powerful image retention ability, resulting in high FVD and IS in UCF101 and MSR-VTT compared to other image-to-video models.

EasyControl: Transfer ControlNet to Video Diffusion for Controllable Generation and Interpolation

TL;DR

Text-guided video generation struggles to capture motion-rich content, necessitating flexible control signals. EasyControl introduces a Condition Adapter that converts a single condition map into multi-layer conditioning features injected into pre-trained T2V diffusion models via residual connections, enabling multi-modal control with low training cost. The framework supports diverse modalities (image, depth, edges, sketch, segmentation) and demonstrates superior quality and controllability across UCF101 and MSR-VTT, outperforming state-of-the-art baselines in FVD, IS, and alignment metrics. By reusing pre-trained models and employing latent-aware conditioning, EasyControl lowers barriers to practical, controllable video synthesis and interpolation.

Abstract

Following the advancements in text-guided image generation technology exemplified by Stable Diffusion, video generation is gaining increased attention in the academic community. However, relying solely on text guidance for video generation has serious limitations, as videos contain much richer content than images, especially in terms of motion. This information can hardly be adequately described with plain text. Fortunately, in computer vision, various visual representations can serve as additional control signals to guide generation. With the help of these signals, video generation can be controlled in finer detail, allowing for greater flexibility for different applications. Integrating various controls, however, is nontrivial. In this paper, we propose a universal framework called EasyControl. By propagating and injecting condition features through condition adapters, our method enables users to control video generation with a single condition map. With our framework, various conditions including raw pixels, depth, HED, etc., can be integrated into different Unet-based pre-trained video diffusion models at a low practical cost. We conduct comprehensive experiments on public datasets, and both quantitative and qualitative results indicate that our method outperforms state-of-the-art methods. EasyControl significantly improves various evaluation metrics across multiple validation datasets compared to previous works. Specifically, for the sketch-to-video generation task, EasyControl achieves an improvement of 152.0 on FVD and 19.9 on IS, respectively, in UCF101 compared with VideoComposer. For fidelity, our model demonstrates powerful image retention ability, resulting in high FVD and IS in UCF101 and MSR-VTT compared to other image-to-video models.
Paper Structure (16 sections, 4 equations, 5 figures, 6 tables)

This paper contains 16 sections, 4 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The architecture illustrations of a multi-condition model, VideoComposer wang2023videocomposer, and our framework, EasyControl. Compared with VideoComposer which takes as input temporally dense conditions and injects the conditions in a concatenation manner, our EasyControl uses only a single frame of condition and injects the condition embeddings through residual summation, thereby increasing the flexibility of the framework to combine different pre-trained T2V models.
  • Figure 2: EasyControl is capable of generating user-defined videos by inputting any condition. Any U-Net-based text-video model can incorporate various types of input conditions through the condition adapter, such as canny, sketches, images, segment masks, and more. Users only need to provide one condition and text, and EasyControl will take care of the rest. On the left is the input condition, and on the right are the frames 1,4,5,8 of the generated videos.
  • Figure 3: The EasyControl architecture encompasses the condition adapter Module, where a feature extractor block is employed to process a singular condition map, extracting pertinent condition features. These features are subsequently extended to the temporal dimension via broadcast mechanism and addition operations, incorporating noise as necessary. The integration of condition information into the generation process is achieved by augmenting the latent representations of the U-Net with multi-layer condition latents derived from the condition adapter.
  • Figure 4: The comparison in image-to-video and sketch-to-video of VideoComposer, EasyControl(VidRD) and EasyControl(ModelScope). ACtrl., Msp. and Vid.Composer denotes EasyControl, ModelScope and VideoComposer.
  • Figure 5: Image and sketch are employed for ablation studies, corresponding to the upper and lower parts of this figure, respectively. For each part, three sets of experiments are conducted using text only, condition only, and text with condition. Given frames 1,4,5,8 of the generated videos.