Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models

Shihao Zhao; Dongdong Chen; Yen-Chun Chen; Jianmin Bao; Shaozhe Hao; Lu Yuan; Kwan-Yee K. Wong

Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models

Shihao Zhao, Dongdong Chen, Yen-Chun Chen, Jianmin Bao, Shaozhe Hao, Lu Yuan, Kwan-Yee K. Wong

TL;DR

Uni-ControlNet addresses the need for flexible, fine-grained control over text-to-image diffusion outputs by introducing two lightweight adapters for local and global controls that operate within a frozen base model. It employs a shared local encoder with multi-scale condition injection and a global encoder that provides additional tokens to cross-attention, enabling composable conditioning with only two adapters regardless of the number of controls. Through separate training of local/global adapters and a simple inference-time fusion, the approach achieves strong controllability and generation fidelity while maintaining practical fine-tuning costs. Experiments on 10M LAION examples and comparisons with existing controllable diffusion methods demonstrate improved FID/CLIP metrics and robust composability across diverse local/global signals.

Abstract

Text-to-Image diffusion models have made tremendous progress over the past two years, enabling the generation of highly realistic images based on open-domain text descriptions. However, despite their success, text descriptions often struggle to adequately convey detailed controls, even when composed of long and complex texts. Moreover, recent studies have also shown that these models face challenges in understanding such complex texts and generating the corresponding images. Therefore, there is a growing need to enable more control modes beyond text description. In this paper, we introduce Uni-ControlNet, a unified framework that allows for the simultaneous utilization of different local controls (e.g., edge maps, depth map, segmentation masks) and global controls (e.g., CLIP image embeddings) in a flexible and composable manner within one single model. Unlike existing methods, Uni-ControlNet only requires the fine-tuning of two additional adapters upon frozen pre-trained text-to-image diffusion models, eliminating the huge cost of training from scratch. Moreover, thanks to some dedicated adapter designs, Uni-ControlNet only necessitates a constant number (i.e., 2) of adapters, regardless of the number of local or global controls used. This not only reduces the fine-tuning costs and model size, making it more suitable for real-world deployment, but also facilitate composability of different conditions. Through both quantitative and qualitative comparisons, Uni-ControlNet demonstrates its superiority over existing methods in terms of controllability, generation quality and composability. Code is available at \url{https://github.com/ShihaoZhaoZSH/Uni-ControlNet}.

Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models

TL;DR

Abstract

Paper Structure (25 sections, 6 equations, 21 figures, 8 tables)

This paper contains 25 sections, 6 equations, 21 figures, 8 tables.

Introduction
Related Work
Text-to-Image Generation
Controllable Diffusion Models
Method
Preliminary
Control Adapter
Training Strategy
Experiments
Implementation Details.
Controllable Generation Results
Comparison with Existing Methods
Ablation Analysis
Conclusion and Social Impact
The Weight of the Global Condition
...and 10 more sections

Figures (21)

Figure 1: Visual results of our proposed Uni-ControlNet. The top and bottom two rows are results for single condition and multi-conditions respectively.
Figure 2: The overall framework of our proposed Uni-ControlNet.
Figure 3: Details of the local and global control adapters.
Figure 4: More visual results of Uni-ControlNet. The top two rows show results of a single condition, with columns 1-7 for local conditions and columns 8-9 for global condition. 3rd row shows the results of combining two local conditions, while row 4-th shows the results of integrating a local condition with a global condition. There is no text prompt for the examples in 4-th row.
Figure 5: Comparison of existing controllable diffusion models on different single conditions.
...and 16 more figures

Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models

TL;DR

Abstract

Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (21)