Table of Contents
Fetching ...

Consistent Image Layout Editing with Diffusion Models

Tao Xia, Yudi Zhang, Ting Liu Lei Zhang

TL;DR

This work tackles the challenge of editing real-image layouts using diffusion models by introducing a two-stage framework that first learns multiple object concepts from a single image (Multi-Concept Learning) and then enforces layout guidance with an appearance-projection mechanism grounded in diffusion-feature semantic consistency. It combines a region-based cross-attention loss for layout control with an Unconditional Appearance Projection and Region Prior Appearance Projection to transfer and refine object appearance in the edited regions, aided by a layout-friendly initialization noise. An asynchronous editing strategy further mitigates concept entanglement while maintaining fidelity. Extensive experiments on Layout-Bench show superior layout alignment and image quality compared with prior methods, demonstrating the practical viability of semantically consistent, diffusion-based layout editing for real images.

Abstract

Despite the great success of large-scale text-to-image diffusion models in image generation and image editing, existing methods still struggle to edit the layout of real images. Although a few works have been proposed to tackle this problem, they either fail to adjust the layout of images, or have difficulty in preserving visual appearance of objects after the layout adjustment. To bridge this gap, this paper proposes a novel image layout editing method that can not only re-arrange a real image to a specified layout, but also can ensure the visual appearance of the objects consistent with their appearance before editing. Concretely, the proposed method consists of two key components. Firstly, a multi-concept learning scheme is used to learn the concepts of different objects from a single image, which is crucial for keeping visual consistency in the layout editing. Secondly, it leverages the semantic consistency within intermediate features of diffusion models to project the appearance information of objects to the desired regions directly. Besides, a novel initialization noise design is adopted to facilitate the process of re-arranging the layout. Extensive experiments demonstrate that the proposed method outperforms previous works in both layout alignment and visual consistency for the task of image layout editing

Consistent Image Layout Editing with Diffusion Models

TL;DR

This work tackles the challenge of editing real-image layouts using diffusion models by introducing a two-stage framework that first learns multiple object concepts from a single image (Multi-Concept Learning) and then enforces layout guidance with an appearance-projection mechanism grounded in diffusion-feature semantic consistency. It combines a region-based cross-attention loss for layout control with an Unconditional Appearance Projection and Region Prior Appearance Projection to transfer and refine object appearance in the edited regions, aided by a layout-friendly initialization noise. An asynchronous editing strategy further mitigates concept entanglement while maintaining fidelity. Extensive experiments on Layout-Bench show superior layout alignment and image quality compared with prior methods, demonstrating the practical viability of semantically consistent, diffusion-based layout editing for real images.

Abstract

Despite the great success of large-scale text-to-image diffusion models in image generation and image editing, existing methods still struggle to edit the layout of real images. Although a few works have been proposed to tackle this problem, they either fail to adjust the layout of images, or have difficulty in preserving visual appearance of objects after the layout adjustment. To bridge this gap, this paper proposes a novel image layout editing method that can not only re-arrange a real image to a specified layout, but also can ensure the visual appearance of the objects consistent with their appearance before editing. Concretely, the proposed method consists of two key components. Firstly, a multi-concept learning scheme is used to learn the concepts of different objects from a single image, which is crucial for keeping visual consistency in the layout editing. Secondly, it leverages the semantic consistency within intermediate features of diffusion models to project the appearance information of objects to the desired regions directly. Besides, a novel initialization noise design is adopted to facilitate the process of re-arranging the layout. Extensive experiments demonstrate that the proposed method outperforms previous works in both layout alignment and visual consistency for the task of image layout editing

Paper Structure

This paper contains 14 sections, 9 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: Examples of layout editing for a single real image. Given a single real image, our method can be used to transform its layout and preserve consistent visual appearance compared to self-guidance-diffusion(SGD epstein2023selfguidance).
  • Figure 2: Semantic consistency in diffusion feature space. The first column shows the original image and the layout editing result by our method, and the following columns show the principal component analysis (PCA) pearson1901closestfit of intermediate diffusion features. The similar semantics share similar colors. It shows that semantic consistency in the RGB space can be extended into diffusion feature space across the whole denoising process
  • Figure 3: Method overview.
  • Figure 4: Failed case and analysis. The attention region edited by CLED zhang2023continuous is spread across the entire image rather than focusing on the object. The right column shows the editing results by removing fine-tuning stage.
  • Figure 5: Mismatched case. Left: Mismatched points by UAP. Right: Corrected matching points by RPAP
  • ...and 10 more figures