Enhancing Prompt Following with Visual Control Through Training-Free Mask-Guided Diffusion

Hongyu Chen; Yiqi Gao; Min Zhou; Peng Wang; Xubin Li; Tiezheng Ge; Bo Zheng

Enhancing Prompt Following with Visual Control Through Training-Free Mask-Guided Diffusion

Hongyu Chen, Yiqi Gao, Min Zhou, Peng Wang, Xubin Li, Tiezheng Ge, Bo Zheng

TL;DR

The paper tackles prompt following under visual control in diffusion-based text-to-image systems, where textual prompts and visual cues can be misaligned. It introduces a training-free method, Mask-guided Prompt Following (MGPF), consisting of Masked ControlNet and Attribute-Matching Loss to separate and align aligned regions, enabling robust object generation and attribute binding. By leveraging object masks and cross-attention-based losses, MGPF achieves superior results across multiple visual controls and metrics, while preserving image aesthetics. The approach generalizes to other diffusion models such as ChilloutMix, offering a practical solution for reliable prompt following in visually controlled generation scenarios.

Abstract

Recently, integrating visual controls into text-to-image~(T2I) models, such as ControlNet method, has received significant attention for finer control capabilities. While various training-free methods make efforts to enhance prompt following in T2I models, the issue with visual control is still rarely studied, especially in the scenario that visual controls are misaligned with text prompts. In this paper, we address the challenge of ``Prompt Following With Visual Control" and propose a training-free approach named Mask-guided Prompt Following (MGPF). Object masks are introduced to distinct aligned and misaligned parts of visual controls and prompts. Meanwhile, a network, dubbed as Masked ControlNet, is designed to utilize these object masks for object generation in the misaligned visual control region. Further, to improve attribute matching, a simple yet efficient loss is designed to align the attention maps of attributes with object regions constrained by ControlNet and object masks. The efficacy and superiority of MGPF are validated through comprehensive quantitative and qualitative experiments.

Enhancing Prompt Following with Visual Control Through Training-Free Mask-Guided Diffusion

TL;DR

Abstract

Paper Structure (11 sections, 8 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 11 sections, 8 equations, 6 figures, 3 tables, 1 algorithm.

Introduction
Related Works
Method
Masked ControlNet
Attribute-Matching Loss
Experiments
Evaluation Setup
Results
Robustness to Other Models: ChilloutMix
Human Evaluations
Conclusions

Figures (6)

Figure 1: The input prompt and visual control are only partially aligned as the left part displays. The ControlNet method result omits yellow flowers and a grassy park, inaccurately painting the cup as blue not white. While our approach MGPF can utilize object masks and produce images with these desired features.
Figure 2: Under visual control, the prompt is respectively fed into or unfed into the U-net and ControlNet. If U-net does not received the prompt, regardless of whether ControlNet has received it or not, the image lacks yellow and purple as mentioned. But if U-net gets the prompt, it is the opposite. This reveals that attribute words mostly work through the cross-attention between U-net and the prompt features.
Figure 3: Overview of the proposed Mask-guided Prompt Following (MGPF) method. Given a prompt and a canny edge condition with misaligned elements such as door edges, along with two object masks indicating "dog" and "skateboard", our approach involves two modules to enhance prompt following. In Masked ControlNet, We union all object masks into a single composite, being reshaped and element-wise multiplied to corresponding ControlNet features, effectively eliminating the influence of undesired visual clues. Incorporated with Attribute-matching Loss, we parse the prompt into attribute-object pairs like "yellow dog" and "green skateboard", obtaining their cross-attention maps from U-net and ControlNet. Subsequently, specific loss functions shift these attention maps in U-net and ControlNet for better attribute binding.
Figure 4: Qualitative comparison using prompts from our dataset. We show images generated by all our baseline methods. We use the same seed across all approaches.
Figure 5: Qualitative ablation results. MC, ML, and LL denote Masked Controlnet, Mask Loss, and Language-Guided Loss respectively.
...and 1 more figures

Enhancing Prompt Following with Visual Control Through Training-Free Mask-Guided Diffusion

TL;DR

Abstract

Enhancing Prompt Following with Visual Control Through Training-Free Mask-Guided Diffusion

Authors

TL;DR

Abstract

Table of Contents

Figures (6)