FreeCond: Free Lunch in the Input Conditions of Text-Guided Inpainting
Teng-Fang Hsiao, Bo-Kai Ruan, Sung-Lin Tsai, Yi-Lun Wu, Hong-Han Shuai
TL;DR
This paper analyzes why text-guided inpainting with Stable Diffusion (SDI) often fails to follow prompt instructions when the image context dominates the result. It uncovers that cross-attention channels adapt to mask inputs and that reducing image-context influence while boosting mask-driven cues can improve instruction-following without extra computation. The authors propose FreeCond, a training-free plug-in that modifies only the image condition $z^c$ via a low-pass filter and the mask condition $M^c$ via scaling, yielding $\hat{\epsilon}_\theta(z_t,z^{fc},M^{fc},t,p)$ and enhancing prompt adherence and mask-fitting across SDI-based models. They also introduce FCIBench, a benchmark with precise, rough, and multi-masks plus complex prompts to rigorously evaluate inpainting under diverse conditions, reporting substantial CLIP-based gains and robust improvements across baselines, with only minor trade-offs in fine-grained detail.
Abstract
In this study, we aim to determine and solve the deficiency of Stable Diffusion Inpainting (SDI) in following the instruction of both prompt and mask. Due to the training bias from masking, the inpainting quality is hindered when the prompt instruction and image condition are not related. Therefore, we conduct a detailed analysis of the internal representations learned by SDI, focusing on how the mask input influences the cross-attention layer. We observe that adapting text key tokens toward the input mask enables the model to selectively paint within the given area. Leveraging these insights, we propose FreeCond, which adjusts only the input mask condition and image condition. By increasing the latent mask value and modifying the frequency of image condition, we align the cross-attention features with the model's training bias to improve generation quality without additional computation, particularly when user inputs are complicated and deviate from the training setup. Extensive experiments demonstrate that FreeCond can enhance any SDI-based model, e.g., yielding up to a 60% and 58% improvement of SDI and SDXLI in the CLIP score.
