Zero-Painter: Training-Free Layout Control for Text-to-Image Synthesis

Marianna Ohanyan; Hayk Manukyan; Zhangyang Wang; Shant Navasardyan; Humphrey Shi

Zero-Painter: Training-Free Layout Control for Text-to-Image Synthesis

Marianna Ohanyan, Hayk Manukyan, Zhangyang Wang, Shant Navasardyan, Humphrey Shi

TL;DR

Zero-Painter tackles layout-conditioned text-to-image synthesis without training by conditioning on per-object masks $M_i$, per-object prompts $\tau_i$, and a global prompt $\tau_{global}$. It introduces a two-stage pipeline: Stage 1 Single Object Generation with Prompt-Adjusted Cross-Attention (PACA) to enforce object shape and attribute fidelity, and Stage 2 Comprehensive Composition with Region-Grouped Cross-Attention (ReGCA) to fuse objects coherently. The main contributions are the PACA and ReGCA blocks, the training-free two-stage framework, and extensive evaluation showing improved shape fidelity and textual alignment over state-of-the-art. This approach enables precise, mask-aware text-to-image synthesis for complex layouts without fine-tuning, leveraging diffusion foundations like Stable Diffusion and SAM-based segmentation. $M_i$, \tau_i$, and $\tau_{global}$ are central conditioning signals in the pipeline.

Abstract

We present Zero-Painter, a novel training-free framework for layout-conditional text-to-image synthesis that facilitates the creation of detailed and controlled imagery from textual prompts. Our method utilizes object masks and individual descriptions, coupled with a global text prompt, to generate images with high fidelity. Zero-Painter employs a two-stage process involving our novel Prompt-Adjusted Cross-Attention (PACA) and Region-Grouped Cross-Attention (ReGCA) blocks, ensuring precise alignment of generated objects with textual prompts and mask shapes. Our extensive experiments demonstrate that Zero-Painter surpasses current state-of-the-art methods in preserving textual details and adhering to mask shapes.

Zero-Painter: Training-Free Layout Control for Text-to-Image Synthesis

TL;DR

Zero-Painter tackles layout-conditioned text-to-image synthesis without training by conditioning on per-object masks

, per-object prompts

, and a global prompt

. It introduces a two-stage pipeline: Stage 1 Single Object Generation with Prompt-Adjusted Cross-Attention (PACA) to enforce object shape and attribute fidelity, and Stage 2 Comprehensive Composition with Region-Grouped Cross-Attention (ReGCA) to fuse objects coherently. The main contributions are the PACA and ReGCA blocks, the training-free two-stage framework, and extensive evaluation showing improved shape fidelity and textual alignment over state-of-the-art. This approach enables precise, mask-aware text-to-image synthesis for complex layouts without fine-tuning, leveraging diffusion foundations like Stable Diffusion and SAM-based segmentation.

, \tau_i

\tau_{global}$ are central conditioning signals in the pipeline.

Abstract

Paper Structure (27 sections, 16 equations, 19 figures, 2 tables)

This paper contains 27 sections, 16 equations, 19 figures, 2 tables.

Introduction
Related Work
Text-to-Image Generation
Layout-to-Image Generation
Text-Guided Image Inpainting
Method
Stable Diffusion
Zero-Painter
Single Object Generation (SOG)
Prompt-Aware Cross-Attention (PACA)
Comprehensive Composition (CC)
Object Segmentation
Inpainting
Region-Grouped Cross-Attention (ReGCA)
Experiments
...and 12 more sections

Figures (19)

Figure 1: Embark on a visual journey with Zero-Painter: a novel training-free framework for layout-conditional text-to-image generation. This new pipeline brings images to life using object masks and individual descriptions, seamlessly fused with a powerful global text prompt.
Figure 2: Optimization-Free Two-Stage Pipeline for Zero-Shot Image Composition: (a) In the first stage, we focus on single object generation, leveraging the innovative Prompt-Adjusted Cross-Attention (PACA) layer. (b) Moving to the comprehensive composition stage, we introduce the Region-Grouped Cross-Attention (ReGCA) block, facilitating seamless and dynamic composition of generated objects.
Figure 3: Effect of the SOT token. The similarity with the SOT token has been increased during text-to-image generation (at every step) in the non-white areas of the masks (left side). Prompt: "photo of a red apple, centered".
Figure 4: Similarity of the SOT token vs all other tokens combined.
Figure 5: Overview of Prompt-Aware Cross-Attention(PACA) during the Invdividual Object Generation stage.
...and 14 more figures

Zero-Painter: Training-Free Layout Control for Text-to-Image Synthesis

TL;DR

Abstract

Zero-Painter: Training-Free Layout Control for Text-to-Image Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (19)