Table of Contents
Fetching ...

LocRef-Diffusion:Tuning-Free Layout and Appearance-Guided Generation

Fan Deng, Yaguang Wu, Xinyang Yu, Xiangjun Huang, Jian Yang, Guangyu Yan, Qiang Xu

TL;DR

LocRef-Diffusion is presented, a novel, tuning-free model capable of personalized customization of multiple instances’ appearance and position within an image, and a Layout-net, which controls instance generation locations by leveraging both explicit instance layout information and an instance region cross-attention module.

Abstract

Recently, text-to-image models based on diffusion have achieved remarkable success in generating high-quality images. However, the challenge of personalized, controllable generation of instances within these images remains an area in need of further development. In this paper, we present LocRef-Diffusion, a novel, tuning-free model capable of personalized customization of multiple instances' appearance and position within an image. To enhance the precision of instance placement, we introduce a Layout-net, which controls instance generation locations by leveraging both explicit instance layout information and an instance region cross-attention module. To improve the appearance fidelity to reference images, we employ an appearance-net that extracts instance appearance features and integrates them into the diffusion model through cross-attention mechanisms. We conducted extensive experiments on the COCO and OpenImages datasets, and the results demonstrate that our proposed method achieves state-of-the-art performance in layout and appearance guided generation.

LocRef-Diffusion:Tuning-Free Layout and Appearance-Guided Generation

TL;DR

LocRef-Diffusion is presented, a novel, tuning-free model capable of personalized customization of multiple instances’ appearance and position within an image, and a Layout-net, which controls instance generation locations by leveraging both explicit instance layout information and an instance region cross-attention module.

Abstract

Recently, text-to-image models based on diffusion have achieved remarkable success in generating high-quality images. However, the challenge of personalized, controllable generation of instances within these images remains an area in need of further development. In this paper, we present LocRef-Diffusion, a novel, tuning-free model capable of personalized customization of multiple instances' appearance and position within an image. To enhance the precision of instance placement, we introduce a Layout-net, which controls instance generation locations by leveraging both explicit instance layout information and an instance region cross-attention module. To improve the appearance fidelity to reference images, we employ an appearance-net that extracts instance appearance features and integrates them into the diffusion model through cross-attention mechanisms. We conducted extensive experiments on the COCO and OpenImages datasets, and the results demonstrate that our proposed method achieves state-of-the-art performance in layout and appearance guided generation.

Paper Structure

This paper contains 15 sections, 2 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The overall architecture of our proposed LocRef-Diffusion with Layout-net and Appearance-Net. Only the newly added modules (in red color) are trained while the pretrained text-to-image model is frozen.
  • Figure 2: Qualitative result for multi-instance generation.