Table of Contents
Fetching ...

SmartControl: Enhancing ControlNet for Handling Rough Visual Conditions

Xiaoyu Liu, Yuxiang Wei, Ming Liu, Xianhui Lin, Peiran Ren, Xuansong Xie, Wangmeng Zuo

TL;DR

SmartControl tackles the misalignment between rough visual conditions and text prompts in controllable T2I by learning a local adaptive control scale map $\boldsymbol{\alpha}$ through a Control Scale Predictor. The approach leverages priors from ControlNet and trains on an unaligned dataset of text–condition pairs to explicitly identify and soften conflicts between $\mathbf{p}$ and $\boldsymbol{c}_{rough}$, enabling region-wise control without manual tuning. Key contributions include the design of a pixel-wise predictor per decoder block, a pipeline for generating and labeling unaligned data, and an objective combining diffusion loss with regularization to enforce sensible background and conflict-region behavior. Empirically, SmartControl improves text–image alignment and preserves rough visual cues across multiple condition types and backbones, with strong generalization to unseen objects and practical applicability demonstrated through extensive ablations and qualitative evaluations; code and data are released for public use.

Abstract

Human visual imagination usually begins with analogies or rough sketches. For example, given an image with a girl playing guitar before a building, one may analogously imagine how it seems like if Iron Man playing guitar before Pyramid in Egypt. Nonetheless, visual condition may not be precisely aligned with the imaginary result indicated by text prompt, and existing layout-controllable text-to-image (T2I) generation models is prone to producing degraded generated results with obvious artifacts. To address this issue, we present a novel T2I generation method dubbed SmartControl, which is designed to modify the rough visual conditions for adapting to text prompt. The key idea of our SmartControl is to relax the visual condition on the areas that are conflicted with text prompts. In specific, a Control Scale Predictor (CSP) is designed to identify the conflict regions and predict the local control scales, while a dataset with text prompts and rough visual conditions is constructed for training CSP. It is worth noting that, even with a limited number (e.g., 1,000~2,000) of training samples, our SmartControl can generalize well to unseen objects. Extensive experiments on four typical visual condition types clearly show the efficacy of our SmartControl against state-of-the-arts. Source code, pre-trained models, and datasets are available at https://github.com/liuxiaoyu1104/SmartControl.

SmartControl: Enhancing ControlNet for Handling Rough Visual Conditions

TL;DR

SmartControl tackles the misalignment between rough visual conditions and text prompts in controllable T2I by learning a local adaptive control scale map through a Control Scale Predictor. The approach leverages priors from ControlNet and trains on an unaligned dataset of text–condition pairs to explicitly identify and soften conflicts between and , enabling region-wise control without manual tuning. Key contributions include the design of a pixel-wise predictor per decoder block, a pipeline for generating and labeling unaligned data, and an objective combining diffusion loss with regularization to enforce sensible background and conflict-region behavior. Empirically, SmartControl improves text–image alignment and preserves rough visual cues across multiple condition types and backbones, with strong generalization to unseen objects and practical applicability demonstrated through extensive ablations and qualitative evaluations; code and data are released for public use.

Abstract

Human visual imagination usually begins with analogies or rough sketches. For example, given an image with a girl playing guitar before a building, one may analogously imagine how it seems like if Iron Man playing guitar before Pyramid in Egypt. Nonetheless, visual condition may not be precisely aligned with the imaginary result indicated by text prompt, and existing layout-controllable text-to-image (T2I) generation models is prone to producing degraded generated results with obvious artifacts. To address this issue, we present a novel T2I generation method dubbed SmartControl, which is designed to modify the rough visual conditions for adapting to text prompt. The key idea of our SmartControl is to relax the visual condition on the areas that are conflicted with text prompts. In specific, a Control Scale Predictor (CSP) is designed to identify the conflict regions and predict the local control scales, while a dataset with text prompts and rough visual conditions is constructed for training CSP. It is worth noting that, even with a limited number (e.g., 1,000~2,000) of training samples, our SmartControl can generalize well to unseen objects. Extensive experiments on four typical visual condition types clearly show the efficacy of our SmartControl against state-of-the-arts. Source code, pre-trained models, and datasets are available at https://github.com/liuxiaoyu1104/SmartControl.
Paper Structure (17 sections, 6 equations, 10 figures, 5 tables)

This paper contains 17 sections, 6 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Our proposed SmartControl can perform controllable image generation under rough visual conditions extracted from other images. In contrast, ControlNet zhang2023adding adheres to control conditions, which may goes against with human intentions.
  • Figure 2: Images generated with different control scale. The plausible images are highlighted in red boxes with different control scale. And it is even infeasible to find suitable control scale for some cases.
  • Figure 3: Framework of proposed SmartControl. Our method is built upon ControlNet, and can generate photo-realistic images with inconsistent prompt and rough visual condition (i.e., tiger v.s. deer) as input. To achieve this, we introduce a control scale predictor $\mathit{f}$ for each decoder block of ControlNet. The predictor takes $\mathbf{h}$ and $\mathbf{h}+ \mathbf{h}_\mathit{cond}$ as input and predicts a pixel-wise control scale map $\boldsymbol{\alpha}$. The condition feature $\mathbf{h}_\mathit{cond}$ is then updated by $\bm{\alpha}\cdot\mathbf{h}_\mathit{cond}$ to relax the control scale at conflict region, resulting a plausible and photo-realistic generated image.
  • Figure 4: Pipeline for unaligned data construction. Given an image and corresponding class, we extract the visual condition $\mathbf{c}_\mathit{rough}$ (e.g., depth) by the pre-trained estimator. Then, for the given class (e.g., deer), we select an alternative unaligned class (e.g., tiger or horse) based on class hierarchy, and use it to obtain the unaligned prompt $\mathbf{p}$. By iterating through different control scale $\alpha$ of ControlNet zhang2023adding, we can generate a series of images for $(\mathbf{c}_\mathit{rough}, \mathbf{p})$. Then, we manually filter those images that are faithful to both text and rough condition to construct our dataset. For example, for tiger, the image generated with $\alpha = 0.4$ is plausible and is added to our dataset. While for horse, there is not a suitable image and all images are discarded.
  • Figure 5: Qualitative comparison with different modalities, image prompts and additional visual conditions. SmartControl achieves reasonable spatial control and superior image-text alignment compared to existing methods, resulting in a closer match to human intentions.
  • ...and 5 more figures