Mask-ControlNet: Higher-Quality Image Generation with An Additional Mask Prompt

Zhiqi Huang; Huixin Xiong; Haoyu Wang; Longguang Wang; Zhiheng Li

Mask-ControlNet: Higher-Quality Image Generation with An Additional Mask Prompt

Zhiqi Huang, Huixin Xiong, Haoyu Wang, Longguang Wang, Zhiheng Li

TL;DR

This work tackles the challenge of fine-grained control in text-to-image generation by addressing object fidelity and scene harmony through a mask-based conditioning strategy. It introduces Mask-ControlNet, which leverages SAM-derived foreground masks as an additional prompt and uses the reference object's image as a conditioning input, thereby decoupling foreground and background during diffusion-based synthesis. The training-time framework freezes the diffusion backbone while learning an adapter and a ControlNet, and the inference-time framework combines SAM masks, object image prompts, and CLIP-guided text to steer generation. Empirical results across diverse datasets show improved fidelity to reference objects, reduced artifacts, and superior or competitive performance across metrics such as CLIP-T, CLIP-I, DINO, FID, PSNR, and SSIM, with favorable qualitative assessments and user studies.

Abstract

Text-to-image generation has witnessed great progress, especially with the recent advancements in diffusion models. Since texts cannot provide detailed conditions like object appearance, reference images are usually leveraged for the control of objects in the generated images. However, existing methods still suffer limited accuracy when the relationship between the foreground and background is complicated. To address this issue, we develop a framework termed Mask-ControlNet by introducing an additional mask prompt. Specifically, we first employ large vision models to obtain masks to segment the objects of interest in the reference image. Then, the object images are employed as additional prompts to facilitate the diffusion model to better understand the relationship between foreground and background regions during image generation. Experiments show that the mask prompts enhance the controllability of the diffusion model to maintain higher fidelity to the reference image while achieving better image quality. Comparison with previous text-to-image generation methods demonstrates our method's superior quantitative and qualitative performance on the benchmark datasets.

Mask-ControlNet: Higher-Quality Image Generation with An Additional Mask Prompt

TL;DR

Abstract

Paper Structure (12 sections, 1 equation, 8 figures, 4 tables)

This paper contains 12 sections, 1 equation, 8 figures, 4 tables.

Introduction
Related Work
Text-to-Image Generative Models
Controllable Generative Models
Methodology
Training-time Framework
Inference-time Framework
Experiments
Experimental Setup
Performance Evaluation
Model Analyses
Conclusion

Figures (8)

Figure 1: Limitations of existing image generation methods. The synthetic images suffer from object distortion (the first row), background overfitting (the second row) and foreground-background inharmony (the last row).
Figure 2: An illustration of our framework during the training phase.
Figure 3: An illustration of our framework during the inference phase.
Figure 4: Synthetic images produced by different methods under the same prompt. From the perspectives of object edges, texture, text, color, etc., our method generates images that are more lossless and closer to reality.
Figure 5: Background contrast generated under the same prompt. From the figure, it can be seen that our method can generate more diverse backgrounds.
...and 3 more figures

Mask-ControlNet: Higher-Quality Image Generation with An Additional Mask Prompt

TL;DR

Abstract

Mask-ControlNet: Higher-Quality Image Generation with An Additional Mask Prompt

Authors

TL;DR

Abstract

Table of Contents

Figures (8)