Table of Contents
Fetching ...

Unified Diffusion-Based Rigid and Non-Rigid Editing with Text and Image Guidance

Jiacheng Wang, Ping Liu, Wei Xu

TL;DR

The paper tackles the challenge of editing images along both rigid and non-rigid dimensions under text or reference-image guidance. It introduces a dual-path injection architecture to separate appearance and structural information and a unified self-attention mechanism to fuse these signals across generation processes. Latent fusion techniques, including AdaIN-based alignment and blend-diffusion background fusion, mitigate color shifts and unintended background changes without model fine-tuning. Empirical results on text-based editing and appearance transfer show competitive or superior performance across rigid and non-rigid tasks, demonstrating versatile, guidance-powered editing.

Abstract

Existing text-to-image editing methods tend to excel either in rigid or non-rigid editing but encounter challenges when combining both, resulting in misaligned outputs with the provided text prompts. In addition, integrating reference images for control remains challenging. To address these issues, we present a versatile image editing framework capable of executing both rigid and non-rigid edits, guided by either textual prompts or reference images. We leverage a dual-path injection scheme to handle diverse editing scenarios and introduce an integrated self-attention mechanism for fusion of appearance and structural information. To mitigate potential visual artifacts, we further employ latent fusion techniques to adjust intermediate latents. Compared to previous work, our approach represents a significant advance in achieving precise and versatile image editing. Comprehensive experiments validate the efficacy of our method, showcasing competitive or superior results in text-based editing and appearance transfer tasks, encompassing both rigid and non-rigid settings.

Unified Diffusion-Based Rigid and Non-Rigid Editing with Text and Image Guidance

TL;DR

The paper tackles the challenge of editing images along both rigid and non-rigid dimensions under text or reference-image guidance. It introduces a dual-path injection architecture to separate appearance and structural information and a unified self-attention mechanism to fuse these signals across generation processes. Latent fusion techniques, including AdaIN-based alignment and blend-diffusion background fusion, mitigate color shifts and unintended background changes without model fine-tuning. Empirical results on text-based editing and appearance transfer show competitive or superior performance across rigid and non-rigid tasks, demonstrating versatile, guidance-powered editing.

Abstract

Existing text-to-image editing methods tend to excel either in rigid or non-rigid editing but encounter challenges when combining both, resulting in misaligned outputs with the provided text prompts. In addition, integrating reference images for control remains challenging. To address these issues, we present a versatile image editing framework capable of executing both rigid and non-rigid edits, guided by either textual prompts or reference images. We leverage a dual-path injection scheme to handle diverse editing scenarios and introduce an integrated self-attention mechanism for fusion of appearance and structural information. To mitigate potential visual artifacts, we further employ latent fusion techniques to adjust intermediate latents. Compared to previous work, our approach represents a significant advance in achieving precise and versatile image editing. Comprehensive experiments validate the efficacy of our method, showcasing competitive or superior results in text-based editing and appearance transfer tasks, encompassing both rigid and non-rigid settings.
Paper Structure (6 sections, 7 equations, 13 figures, 2 tables)

This paper contains 6 sections, 7 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Illustration of limitations of prior works on rigid and non-rigid editing tasks.
  • Figure 2: An Overview of Our Approach. Given the input pairs ($I^{app}$, $T^{struct}$), ($I^{struct}$, $T^{app}$), and ($I^{app}$, $I^{struct}$), our method aims to produce the desired editing result $I^{out}_1$, $I^{out}_2$, $I^{out}_2$, which corresponds to achieving text-based non-rigid edits, text-based rigid edits, and image-based edits, respectively. During the denoising step, the source image and target guidance correspond to distinct generation processes, from where information are subsequently injected into the editing process to achieve the desired manipulation.
  • Figure 3: Qualitative Comparisons with Text-Based Editing Methods: P2Pp2p, PnPpnp, MasaCtrlmasactrl. Our approach successfully accomplishes both rigid and non-rigid editing, demonstrating improved alignment with the target prompt.
  • Figure 4: Qualitative Comparisons to Appearance Transfer Methods. Our method can effectively integrate appearance and structural information from different images.
  • Figure 5: Ablation Study. Each column represents the effect of removing a specific component.
  • ...and 8 more figures