Table of Contents
Fetching ...

Local Conditional Controlling for Text-to-Image Diffusion Models

Yibo Zhao, Liang Peng, Yang Yang, Zekai Luo, Hengjia Li, Yao Chen, Zheng Yang, Xiaofei He, Wei Zhao, qinglin lu, Boxi Wu, Wei Liu

TL;DR

This paper proposes Regional Discriminate Loss to update the noised latents, aiming at enhanced object generation in non-control regions, and adopts Feature Mask Constraint to reduce quality degradation in images caused by information differences across the local control region.

Abstract

Diffusion models have exhibited impressive prowess in the text-to-image task. Recent methods add image-level structure controls, e.g., edge and depth maps, to manipulate the generation process together with text prompts to obtain desired images. This controlling process is globally operated on the entire image, which limits the flexibility of control regions. In this paper, we explore a novel and practical task setting: local control. It focuses on controlling specific local region according to user-defined image conditions, while the remaining regions are only conditioned by the original text prompt. However, it is non-trivial to achieve local conditional controlling. The naive manner of directly adding local conditions may lead to the local control dominance problem, which forces the model to focus on the controlled region and neglect object generation in other regions. To mitigate this problem, we propose Regional Discriminate Loss to update the noised latents, aiming at enhanced object generation in non-control regions. Furthermore, the proposed Focused Token Response suppresses weaker attention scores which lack the strongest response to enhance object distinction and reduce duplication. Lastly, we adopt Feature Mask Constraint to reduce quality degradation in images caused by information differences across the local control region. All proposed strategies are operated at the inference stage. Extensive experiments demonstrate that our method can synthesize high-quality images aligned with the text prompt under local control conditions.

Local Conditional Controlling for Text-to-Image Diffusion Models

TL;DR

This paper proposes Regional Discriminate Loss to update the noised latents, aiming at enhanced object generation in non-control regions, and adopts Feature Mask Constraint to reduce quality degradation in images caused by information differences across the local control region.

Abstract

Diffusion models have exhibited impressive prowess in the text-to-image task. Recent methods add image-level structure controls, e.g., edge and depth maps, to manipulate the generation process together with text prompts to obtain desired images. This controlling process is globally operated on the entire image, which limits the flexibility of control regions. In this paper, we explore a novel and practical task setting: local control. It focuses on controlling specific local region according to user-defined image conditions, while the remaining regions are only conditioned by the original text prompt. However, it is non-trivial to achieve local conditional controlling. The naive manner of directly adding local conditions may lead to the local control dominance problem, which forces the model to focus on the controlled region and neglect object generation in other regions. To mitigate this problem, we propose Regional Discriminate Loss to update the noised latents, aiming at enhanced object generation in non-control regions. Furthermore, the proposed Focused Token Response suppresses weaker attention scores which lack the strongest response to enhance object distinction and reduce duplication. Lastly, we adopt Feature Mask Constraint to reduce quality degradation in images caused by information differences across the local control region. All proposed strategies are operated at the inference stage. Extensive experiments demonstrate that our method can synthesize high-quality images aligned with the text prompt under local control conditions.
Paper Structure (13 sections, 10 equations, 8 figures, 2 tables)

This paper contains 13 sections, 10 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Previous global control mechanism mainly synthesizes images similar to structure conditions, but has difficulty generating results aligned with text prompts. Even adding control mask could only produce concepts closest to the local condition. Therefore, we explore local control, which leverages text prompts, image conditions, and user-defined regions for local control as inputs. Our proposed method successfully generate images that are faithful to both the prompts and local control conditions.
  • Figure 2: In the ControlNet pipeline, local conditions dominate the image generation. The attention map for "dog" exhibits a higher response in local areas while a lower response in non-local regions where the dog should be generated. Meanwhile, we showcase the result of our method applied to local control , where both "cat" and "dog" are successfully generated.
  • Figure 3: Overview of our method. Given the text condition and local control condition, a latent variable $z_T$ is passed into the denoising network, i.e., UNet. At each denoising step, we apply Feature Mask Constraint (see Section \ref{['sec:Feature']}) to the ControlNet branch output features. Cross-attention maps generated from UNet are refined by Focused Token Response (see Section \ref{['sec:Focused']}), which suppresses weaker interfering responses to enhance object distinction. Furthermore, we use Regional Discriminate Loss (see Section \ref{['sec:Attention']}) to update the latent $z_t$, thereby identifying and regenerating ignored objects in the cross-attention map.
  • Figure 4: Visual results of our method under the local control setting. The proposed method can be employed for flexibly controlling the local regions with image conditions.
  • Figure 5: Comparisons with several baselines under varying conditions. Our method, along with T2I-Adapter and ControlNet, uses local conditions as input. Conversely, Noise-Mask, Feature-Mask, and Inpainting use global conditions as inputs. All these methods use the same control mask. Compared with these baselines, our method can synthesize high-quality images in local control, especially in terms of text alignment.
  • ...and 3 more figures