Table of Contents
Fetching ...

CtrLoRA: An Extensible and Efficient Framework for Controllable Image Generation

Yifeng Xu, Zhenliang He, Shiguang Shan, Xilin Chen

TL;DR

CtrLoRA tackles the high cost of training independent ControlNets for every condition by introducing a shared Base ControlNet trained on multiple base conditions plus per-condition LoRAs. The Base ControlNet captures general I2I knowledge, while LoRAs encode condition-specific traits, enabling rapid adaptation to new conditions with as few as $1{,}000$ samples and less than an hour of single-GPU training, using roughly $37$M trainable parameters per new condition. A pretrained VAE as the condition embedding network accelerates convergence and mitigates sudden convergence, facilitating stable training. The approach supports multi-condition generation by composing LoRAs and can be integrated into community diffusion models, significantly reducing deployment barriers and enabling scalable controllable image generation.

Abstract

Recently, large-scale diffusion models have made impressive progress in text-to-image (T2I) generation. To further equip these T2I models with fine-grained spatial control, approaches like ControlNet introduce an extra network that learns to follow a condition image. However, for every single condition type, ControlNet requires independent training on millions of data pairs with hundreds of GPU hours, which is quite expensive and makes it challenging for ordinary users to explore and develop new types of conditions. To address this problem, we propose the CtrLoRA framework, which trains a Base ControlNet to learn the common knowledge of image-to-image generation from multiple base conditions, along with condition-specific LoRAs to capture distinct characteristics of each condition. Utilizing our pretrained Base ControlNet, users can easily adapt it to new conditions, requiring as few as 1,000 data pairs and less than one hour of single-GPU training to obtain satisfactory results in most scenarios. Moreover, our CtrLoRA reduces the learnable parameters by 90% compared to ControlNet, significantly lowering the threshold to distribute and deploy the model weights. Extensive experiments on various types of conditions demonstrate the efficiency and effectiveness of our method. Codes and model weights will be released at https://github.com/xyfJASON/ctrlora.

CtrLoRA: An Extensible and Efficient Framework for Controllable Image Generation

TL;DR

CtrLoRA tackles the high cost of training independent ControlNets for every condition by introducing a shared Base ControlNet trained on multiple base conditions plus per-condition LoRAs. The Base ControlNet captures general I2I knowledge, while LoRAs encode condition-specific traits, enabling rapid adaptation to new conditions with as few as samples and less than an hour of single-GPU training, using roughly M trainable parameters per new condition. A pretrained VAE as the condition embedding network accelerates convergence and mitigates sudden convergence, facilitating stable training. The approach supports multi-condition generation by composing LoRAs and can be integrated into community diffusion models, significantly reducing deployment barriers and enabling scalable controllable image generation.

Abstract

Recently, large-scale diffusion models have made impressive progress in text-to-image (T2I) generation. To further equip these T2I models with fine-grained spatial control, approaches like ControlNet introduce an extra network that learns to follow a condition image. However, for every single condition type, ControlNet requires independent training on millions of data pairs with hundreds of GPU hours, which is quite expensive and makes it challenging for ordinary users to explore and develop new types of conditions. To address this problem, we propose the CtrLoRA framework, which trains a Base ControlNet to learn the common knowledge of image-to-image generation from multiple base conditions, along with condition-specific LoRAs to capture distinct characteristics of each condition. Utilizing our pretrained Base ControlNet, users can easily adapt it to new conditions, requiring as few as 1,000 data pairs and less than one hour of single-GPU training to obtain satisfactory results in most scenarios. Moreover, our CtrLoRA reduces the learnable parameters by 90% compared to ControlNet, significantly lowering the threshold to distribute and deploy the model weights. Extensive experiments on various types of conditions demonstrate the efficiency and effectiveness of our method. Codes and model weights will be released at https://github.com/xyfJASON/ctrlora.

Paper Structure

This paper contains 37 sections, 5 equations, 19 figures, 8 tables.

Figures (19)

  • Figure 1: Our results of single-conditional generation, multi-conditional generation, style transfer.
  • Figure 2: Overview of the CtrLoRA framework. "CN" denotes Base ControlNet, "L" denotes LoRA. (a) We first train a shared Base ControlNet in conjunction with condition-specific LoRAs on a large-scale dataset that contains multiple base conditions. (b) The trained Base ControlNet can be easily adapted to novel conditions with significantly less data, fewer devices, and shorter time.
  • Figure 3: Training and inference of our CtrLoRA framework. "SD" denotes Stable Diffusion, "CN" denotes Base ControlNet, and "L"s in different colors denote LoRAs for different conditions.
  • Figure 4: Visual comparison on base conditions.
  • Figure 5: Visual comparison on new conditions.
  • ...and 14 more figures