Table of Contents
Fetching ...

ECNet: Effective Controllable Text-to-Image Diffusion Models

Sicheng Li, Keqiang Sun, Zhixin Lai, Xiaoshi Wu, Feng Qiu, Haoran Xie, Kazunori Miyata, Hongsheng Li

TL;DR

A Spatial Guidance Injector (SGI) is proposed which enhances conditional detail by encoding text inputs with precise annotation information and Diffusion Consistency Loss (DCL), which applies supervision on the denoised latent code at any given time step, enhancing the robustness and accuracy of the output.

Abstract

The conditional text-to-image diffusion models have garnered significant attention in recent years. However, the precision of these models is often compromised mainly for two reasons, ambiguous condition input and inadequate condition guidance over single denoising loss. To address the challenges, we introduce two innovative solutions. Firstly, we propose a Spatial Guidance Injector (SGI) which enhances conditional detail by encoding text inputs with precise annotation information. This method directly tackles the issue of ambiguous control inputs by providing clear, annotated guidance to the model. Secondly, to overcome the issue of limited conditional supervision, we introduce Diffusion Consistency Loss (DCL), which applies supervision on the denoised latent code at any given time step. This encourages consistency between the latent code at each time step and the input signal, thereby enhancing the robustness and accuracy of the output. The combination of SGI and DCL results in our Effective Controllable Network (ECNet), which offers a more accurate controllable end-to-end text-to-image generation framework with a more precise conditioning input and stronger controllable supervision. We validate our approach through extensive experiments on generation under various conditions, such as human body skeletons, facial landmarks, and sketches of general objects. The results consistently demonstrate that our method significantly enhances the controllability and robustness of the generated images, outperforming existing state-of-the-art controllable text-to-image models.

ECNet: Effective Controllable Text-to-Image Diffusion Models

TL;DR

A Spatial Guidance Injector (SGI) is proposed which enhances conditional detail by encoding text inputs with precise annotation information and Diffusion Consistency Loss (DCL), which applies supervision on the denoised latent code at any given time step, enhancing the robustness and accuracy of the output.

Abstract

The conditional text-to-image diffusion models have garnered significant attention in recent years. However, the precision of these models is often compromised mainly for two reasons, ambiguous condition input and inadequate condition guidance over single denoising loss. To address the challenges, we introduce two innovative solutions. Firstly, we propose a Spatial Guidance Injector (SGI) which enhances conditional detail by encoding text inputs with precise annotation information. This method directly tackles the issue of ambiguous control inputs by providing clear, annotated guidance to the model. Secondly, to overcome the issue of limited conditional supervision, we introduce Diffusion Consistency Loss (DCL), which applies supervision on the denoised latent code at any given time step. This encourages consistency between the latent code at each time step and the input signal, thereby enhancing the robustness and accuracy of the output. The combination of SGI and DCL results in our Effective Controllable Network (ECNet), which offers a more accurate controllable end-to-end text-to-image generation framework with a more precise conditioning input and stronger controllable supervision. We validate our approach through extensive experiments on generation under various conditions, such as human body skeletons, facial landmarks, and sketches of general objects. The results consistently demonstrate that our method significantly enhances the controllability and robustness of the generated images, outperforming existing state-of-the-art controllable text-to-image models.
Paper Structure (19 sections, 8 equations, 7 figures, 2 tables)

This paper contains 19 sections, 8 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: The core work of this paper is to design a general framework for supervised training of diffusion models, and enhancing the controllability of text-to-image diffusion models. The figures show three categories conditions, skeleton (rows I and II), facial landmark (row III), and sketch (row IV). Each category includes five sets of images, where each set displays: (a) the original image used for reference; (b) the pose (facial contour or sketch) image derived from the original image as the control condition; (c) results generated by ControlNet; (d) results generated by the baseline model, HumanSD, for comparison; (e) results generated by ECNet (our model). Compared to ControlNet and HumanSD, our model ECNet exhibits superior capabilities and robustness in image generation with control across all categories.
  • Figure 2: The framework and its loss design are illustrated using the task of skeleton control as an example. our model encodes the skeleton image into a latent code via a VAE to obtain a pose latent code. This code combines with diffusion's noise code as input for a U-Net. Additionally, Our $\textit{SGI}$ module further combines corresponding pose annotations and text, integrating them into the U-Net layers. During the training phase, we enhance the conditional generation capabilities of the diffusion model by introducing $\textit{DCL}$. $\textit{DCL}$ targets heatmap disparities between estimated and input images, using dual-stage loss to impose consistency supervision throughout the duffusion process. $z$ represents the latent code and $x$ denotes the image decoded from $z$. Please refer to \ref{['sec:method']} for more details.
  • Figure 3: The decoded images of pose and face. First row: The process of incrementally adding noise to the original image over time steps; second row: the noise difference; third row: denoised results derived from the predicted noise latent code. White areas show the detected keypoint heatmaps.
  • Figure 4: Generated images on the skeleton control task. The comparison of generated results across various scenarios validates the prior adaptability of ECNet.
  • Figure 5: Generated images on facial landmarks control task. The comparison of generated results based on landmark control validates ECNet surpasses former SD-based models in this task.
  • ...and 2 more figures