Table of Contents
Fetching ...

FreeCond: Free Lunch in the Input Conditions of Text-Guided Inpainting

Teng-Fang Hsiao, Bo-Kai Ruan, Sung-Lin Tsai, Yi-Lun Wu, Hong-Han Shuai

TL;DR

This paper analyzes why text-guided inpainting with Stable Diffusion (SDI) often fails to follow prompt instructions when the image context dominates the result. It uncovers that cross-attention channels adapt to mask inputs and that reducing image-context influence while boosting mask-driven cues can improve instruction-following without extra computation. The authors propose FreeCond, a training-free plug-in that modifies only the image condition $z^c$ via a low-pass filter and the mask condition $M^c$ via scaling, yielding $\hat{\epsilon}_\theta(z_t,z^{fc},M^{fc},t,p)$ and enhancing prompt adherence and mask-fitting across SDI-based models. They also introduce FCIBench, a benchmark with precise, rough, and multi-masks plus complex prompts to rigorously evaluate inpainting under diverse conditions, reporting substantial CLIP-based gains and robust improvements across baselines, with only minor trade-offs in fine-grained detail.

Abstract

In this study, we aim to determine and solve the deficiency of Stable Diffusion Inpainting (SDI) in following the instruction of both prompt and mask. Due to the training bias from masking, the inpainting quality is hindered when the prompt instruction and image condition are not related. Therefore, we conduct a detailed analysis of the internal representations learned by SDI, focusing on how the mask input influences the cross-attention layer. We observe that adapting text key tokens toward the input mask enables the model to selectively paint within the given area. Leveraging these insights, we propose FreeCond, which adjusts only the input mask condition and image condition. By increasing the latent mask value and modifying the frequency of image condition, we align the cross-attention features with the model's training bias to improve generation quality without additional computation, particularly when user inputs are complicated and deviate from the training setup. Extensive experiments demonstrate that FreeCond can enhance any SDI-based model, e.g., yielding up to a 60% and 58% improvement of SDI and SDXLI in the CLIP score.

FreeCond: Free Lunch in the Input Conditions of Text-Guided Inpainting

TL;DR

This paper analyzes why text-guided inpainting with Stable Diffusion (SDI) often fails to follow prompt instructions when the image context dominates the result. It uncovers that cross-attention channels adapt to mask inputs and that reducing image-context influence while boosting mask-driven cues can improve instruction-following without extra computation. The authors propose FreeCond, a training-free plug-in that modifies only the image condition via a low-pass filter and the mask condition via scaling, yielding and enhancing prompt adherence and mask-fitting across SDI-based models. They also introduce FCIBench, a benchmark with precise, rough, and multi-masks plus complex prompts to rigorously evaluate inpainting under diverse conditions, reporting substantial CLIP-based gains and robust improvements across baselines, with only minor trade-offs in fine-grained detail.

Abstract

In this study, we aim to determine and solve the deficiency of Stable Diffusion Inpainting (SDI) in following the instruction of both prompt and mask. Due to the training bias from masking, the inpainting quality is hindered when the prompt instruction and image condition are not related. Therefore, we conduct a detailed analysis of the internal representations learned by SDI, focusing on how the mask input influences the cross-attention layer. We observe that adapting text key tokens toward the input mask enables the model to selectively paint within the given area. Leveraging these insights, we propose FreeCond, which adjusts only the input mask condition and image condition. By increasing the latent mask value and modifying the frequency of image condition, we align the cross-attention features with the model's training bias to improve generation quality without additional computation, particularly when user inputs are complicated and deviate from the training setup. Extensive experiments demonstrate that FreeCond can enhance any SDI-based model, e.g., yielding up to a 60% and 58% improvement of SDI and SDXLI in the CLIP score.

Paper Structure

This paper contains 17 sections, 4 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Comparison of existing SOTA methods. BrushNet rigidly follows the mask instructions but only partially adheres to the prompt. PowerPaint produces outputs that are harmonious with the image context but at the cost of reduced prompt-adherence. FreeCond addresses these limitations, as shown in \ref{['fig:teaser']}.
  • Figure 2: Visualization of contextual influence: A random prompt, unrelated to the image condition, is assigned. The input mask is shown in columns 1 and 3, along with the corresponding prompt, while shifted areas are highlighted with a green frame. The resulting outputs are displayed in columns 2 and 4.
  • Figure 3: Illustration of mask size impact on inpainting metrics.
  • Figure 4: A self-attention visualization in different layers. The attention from $M$ is colored.
  • Figure 5: A cross-attention visualization of \ref{['fig:self_attn_vis']} in different cross-attention layers. The attention follows the input mask shape in the first layer, adapting to the output shape in the deeper layer.
  • ...and 5 more figures