Table of Contents
Fetching ...

OmniControlNet: Dual-stage Integration for Conditional Image Generation

Yilin Wang, Haiyang Xu, Xiang Zhang, Zeyuan Chen, Zhizhou Sha, Zirui Wang, Zhuowen Tu

TL;DR

OmniControlNet tackles the redundancy of ControlNet by integrating condition generation and conditioned diffusion into a unified dual-stage pipeline. Stage 1 delivers a multi-task dense image predictor that handles depth, edges, scribbles, and animal poses within a single model, while Stage 2 uses textual inversion-guided prompts to drive a single conditioned diffusion path across all conditioning types. The approach achieves substantially lower model complexity and data needs with competitive image quality compared to existing integrated methods, and includes thorough ablations that highlight the benefits of multi-head, one-hot task encoding and task-prefix strategies. This work offers a practical pathway toward a compact, flexible conditioning framework suitable for broad real-world diffusion-based generation tasks.

Abstract

We provide a two-way integration for the widely adopted ControlNet by integrating external condition generation algorithms into a single dense prediction method and incorporating its individually trained image generation processes into a single model. Despite its tremendous success, the ControlNet of a two-stage pipeline bears limitations in being not self-contained (e.g. calls the external condition generation algorithms) with a large model redundancy (separately trained models for different types of conditioning inputs). Our proposed OmniControlNet consolidates 1) the condition generation (e.g., HED edges, depth maps, user scribble, and animal pose) by a single multi-tasking dense prediction algorithm under the task embedding guidance and 2) the image generation process for different conditioning types under the textual embedding guidance. OmniControlNet achieves significantly reduced model complexity and redundancy while capable of producing images of comparable quality for conditioned text-to-image generation.

OmniControlNet: Dual-stage Integration for Conditional Image Generation

TL;DR

OmniControlNet tackles the redundancy of ControlNet by integrating condition generation and conditioned diffusion into a unified dual-stage pipeline. Stage 1 delivers a multi-task dense image predictor that handles depth, edges, scribbles, and animal poses within a single model, while Stage 2 uses textual inversion-guided prompts to drive a single conditioned diffusion path across all conditioning types. The approach achieves substantially lower model complexity and data needs with competitive image quality compared to existing integrated methods, and includes thorough ablations that highlight the benefits of multi-head, one-hot task encoding and task-prefix strategies. This work offers a practical pathway toward a compact, flexible conditioning framework suitable for broad real-world diffusion-based generation tasks.

Abstract

We provide a two-way integration for the widely adopted ControlNet by integrating external condition generation algorithms into a single dense prediction method and incorporating its individually trained image generation processes into a single model. Despite its tremendous success, the ControlNet of a two-stage pipeline bears limitations in being not self-contained (e.g. calls the external condition generation algorithms) with a large model redundancy (separately trained models for different types of conditioning inputs). Our proposed OmniControlNet consolidates 1) the condition generation (e.g., HED edges, depth maps, user scribble, and animal pose) by a single multi-tasking dense prediction algorithm under the task embedding guidance and 2) the image generation process for different conditioning types under the textual embedding guidance. OmniControlNet achieves significantly reduced model complexity and redundancy while capable of producing images of comparable quality for conditioned text-to-image generation.
Paper Structure (23 sections, 3 equations, 6 figures, 7 tables)

This paper contains 23 sections, 3 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Given an input image, our single, integrated OmniControlNet extracts its control features and generates high-quality images. From the first to the last row in the middle, the feature visualization represents Depth, HED, Scribble, and Animal Pose respectively.
  • Figure 2: Our OmniControlNet model. From condition generation to image synthesis, while the ControlNet model has to deal with all the features separately, our model can handle the tasks within an integrated pipeline.
  • Figure 3: Original ControlNet zhang2023adding model. For different features, we have to use different expert models for condition generation, and we have to train ControlNet on each of the features.
  • Figure 4: An overview of our multi-task dense image prediction pipeline. First, we leverage a Swin Transformer to extract multi-scale features and propose a multi-head FPN to get full-resolution feature maps. Finally, we utilize task-specific embeddings to decode dense predictions from the feature maps.
  • Figure 5: An overview of our conditioned text-to-image generation pipeline. Beginning with the original ControlNet structure zhang2023adding, we utilize the textual inversion to learn task embeddings. Subsequently, we append the prefix use < feature> as feature to the prompt and feed the result into the trainable copy. The left side of the figure provides an overview of the conditioned text-to-image generation model, while the right side illustrates the process of learning the CLIP embedding for the new "word" with textual inversion gal2022image.
  • ...and 1 more figures