SmartControl: Enhancing ControlNet for Handling Rough Visual Conditions
Xiaoyu Liu, Yuxiang Wei, Ming Liu, Xianhui Lin, Peiran Ren, Xuansong Xie, Wangmeng Zuo
TL;DR
SmartControl tackles the misalignment between rough visual conditions and text prompts in controllable T2I by learning a local adaptive control scale map $\boldsymbol{\alpha}$ through a Control Scale Predictor. The approach leverages priors from ControlNet and trains on an unaligned dataset of text–condition pairs to explicitly identify and soften conflicts between $\mathbf{p}$ and $\boldsymbol{c}_{rough}$, enabling region-wise control without manual tuning. Key contributions include the design of a pixel-wise predictor per decoder block, a pipeline for generating and labeling unaligned data, and an objective combining diffusion loss with regularization to enforce sensible background and conflict-region behavior. Empirically, SmartControl improves text–image alignment and preserves rough visual cues across multiple condition types and backbones, with strong generalization to unseen objects and practical applicability demonstrated through extensive ablations and qualitative evaluations; code and data are released for public use.
Abstract
Human visual imagination usually begins with analogies or rough sketches. For example, given an image with a girl playing guitar before a building, one may analogously imagine how it seems like if Iron Man playing guitar before Pyramid in Egypt. Nonetheless, visual condition may not be precisely aligned with the imaginary result indicated by text prompt, and existing layout-controllable text-to-image (T2I) generation models is prone to producing degraded generated results with obvious artifacts. To address this issue, we present a novel T2I generation method dubbed SmartControl, which is designed to modify the rough visual conditions for adapting to text prompt. The key idea of our SmartControl is to relax the visual condition on the areas that are conflicted with text prompts. In specific, a Control Scale Predictor (CSP) is designed to identify the conflict regions and predict the local control scales, while a dataset with text prompts and rough visual conditions is constructed for training CSP. It is worth noting that, even with a limited number (e.g., 1,000~2,000) of training samples, our SmartControl can generalize well to unseen objects. Extensive experiments on four typical visual condition types clearly show the efficacy of our SmartControl against state-of-the-arts. Source code, pre-trained models, and datasets are available at https://github.com/liuxiaoyu1104/SmartControl.
