Table of Contents
Fetching ...

MFTF: Mask-free Training-free Object Level Layout Control Diffusion Model

Shan Yang

TL;DR

MFTF tackles the challenge of precise object-level layout control in diffusion-based text-to-image generation without relying on masks, extra guidance, or model fine-tuning. It achieves this by a parallel denoising scheme that uses cross-attention maps from a source diffusion model to dynamically generate masks, which are applied to self-attention queries in a target diffusion model, enabling translations, rotations, and other layout changes while preserving semantic intent. The approach supports single- and multi-object control and simultaneous layout manipulation with semantic editing, and it also enables text-guided segmentation via attention maps. Overall, MFTF advances flexible, mask-free, training-free image generation with strong potential for practical editing, segmentation, and downstream vision-language applications.

Abstract

Text-to-image generation models have revolutionized content creation, but diffusion-based vision-language models still face challenges in precisely controlling the shape, appearance, and positional placement of objects in generated images using text guidance alone. Existing global image editing models rely on additional masks or images as guidance to achieve layout control, often requiring retraining of the model. While local object-editing models allow modifications to object shapes, they lack the capability to control object positions. To address these limitations, we propose the Mask-free Training-free Object-Level Layout Control Diffusion Model (MFTF), which provides precise control over object positions without requiring additional masks or images. The MFTF model supports both single-object and multi-object positional adjustments, such as translation and rotation, while enabling simultaneous layout control and object semantic editing. The MFTF model employs a parallel denoising process for both the source and target diffusion models. During this process, attention masks are dynamically generated from the cross-attention layers of the source diffusion model and applied to queries from the self-attention layers to isolate objects. These queries, generated in the source diffusion model, are then adjusted according to the layout control parameters and re-injected into the self-attention layers of the target diffusion model. This approach ensures accurate and precise positional control of objects. Project source code available at https://github.com/syang-genai/MFTF.

MFTF: Mask-free Training-free Object Level Layout Control Diffusion Model

TL;DR

MFTF tackles the challenge of precise object-level layout control in diffusion-based text-to-image generation without relying on masks, extra guidance, or model fine-tuning. It achieves this by a parallel denoising scheme that uses cross-attention maps from a source diffusion model to dynamically generate masks, which are applied to self-attention queries in a target diffusion model, enabling translations, rotations, and other layout changes while preserving semantic intent. The approach supports single- and multi-object control and simultaneous layout manipulation with semantic editing, and it also enables text-guided segmentation via attention maps. Overall, MFTF advances flexible, mask-free, training-free image generation with strong potential for practical editing, segmentation, and downstream vision-language applications.

Abstract

Text-to-image generation models have revolutionized content creation, but diffusion-based vision-language models still face challenges in precisely controlling the shape, appearance, and positional placement of objects in generated images using text guidance alone. Existing global image editing models rely on additional masks or images as guidance to achieve layout control, often requiring retraining of the model. While local object-editing models allow modifications to object shapes, they lack the capability to control object positions. To address these limitations, we propose the Mask-free Training-free Object-Level Layout Control Diffusion Model (MFTF), which provides precise control over object positions without requiring additional masks or images. The MFTF model supports both single-object and multi-object positional adjustments, such as translation and rotation, while enabling simultaneous layout control and object semantic editing. The MFTF model employs a parallel denoising process for both the source and target diffusion models. During this process, attention masks are dynamically generated from the cross-attention layers of the source diffusion model and applied to queries from the self-attention layers to isolate objects. These queries, generated in the source diffusion model, are then adjusted according to the layout control parameters and re-injected into the self-attention layers of the target diffusion model. This approach ensures accurate and precise positional control of objects. Project source code available at https://github.com/syang-genai/MFTF.

Paper Structure

This paper contains 26 sections, 5 equations, 13 figures, 1 table, 1 algorithm.

Figures (13)

  • Figure 1: MFTF successfully achieves single-object and multi-object layout control, as well as simultaneous layout control and semantic editing, without guidance image or mask and no model training and fine-tuning.
  • Figure 2: MFTF Architecture. MFTF dynamically generates attention masks from the cross-attention layers of the source diffusion model. These attention masks are then applied to the queries $Q_s$ derived from the self-attention layers to separate the objects from background. Subsequently, the modified queries $Q_{ms}$ are generated in accordance with the layout control parameters $L$. Finally, $Q_{ms}$ is injected into the self-attention layer of the target diffusion model, enabling precise positional control via denoising process.
  • Figure 3: Cross-attention mask $M^i_s$ is generated from the cross-attention layers $A^i_s$
  • Figure 4: Visualization of $Q$ is presented at different self-attention layers $l = 11, 13, 15$, showing masks from the following conditions: without cross-attention mask, with cross-attention mask applied, and with additional positional control $L$.
  • Figure 5: Quantitative comparison of images: objective-level layout control via text descriptions vs. MFTF model; without vs. with MFTF model.
  • ...and 8 more figures