MFTF: Mask-free Training-free Object Level Layout Control Diffusion Model

Shan Yang

MFTF: Mask-free Training-free Object Level Layout Control Diffusion Model

Shan Yang

TL;DR

MFTF tackles the challenge of precise object-level layout control in diffusion-based text-to-image generation without relying on masks, extra guidance, or model fine-tuning. It achieves this by a parallel denoising scheme that uses cross-attention maps from a source diffusion model to dynamically generate masks, which are applied to self-attention queries in a target diffusion model, enabling translations, rotations, and other layout changes while preserving semantic intent. The approach supports single- and multi-object control and simultaneous layout manipulation with semantic editing, and it also enables text-guided segmentation via attention maps. Overall, MFTF advances flexible, mask-free, training-free image generation with strong potential for practical editing, segmentation, and downstream vision-language applications.

Abstract

Text-to-image generation models have revolutionized content creation, but diffusion-based vision-language models still face challenges in precisely controlling the shape, appearance, and positional placement of objects in generated images using text guidance alone. Existing global image editing models rely on additional masks or images as guidance to achieve layout control, often requiring retraining of the model. While local object-editing models allow modifications to object shapes, they lack the capability to control object positions. To address these limitations, we propose the Mask-free Training-free Object-Level Layout Control Diffusion Model (MFTF), which provides precise control over object positions without requiring additional masks or images. The MFTF model supports both single-object and multi-object positional adjustments, such as translation and rotation, while enabling simultaneous layout control and object semantic editing. The MFTF model employs a parallel denoising process for both the source and target diffusion models. During this process, attention masks are dynamically generated from the cross-attention layers of the source diffusion model and applied to queries from the self-attention layers to isolate objects. These queries, generated in the source diffusion model, are then adjusted according to the layout control parameters and re-injected into the self-attention layers of the target diffusion model. This approach ensures accurate and precise positional control of objects. Project source code available at https://github.com/syang-genai/MFTF.

MFTF: Mask-free Training-free Object Level Layout Control Diffusion Model

TL;DR

Abstract

MFTF: Mask-free Training-free Object Level Layout Control Diffusion Model

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)