Table of Contents
Fetching ...

Structure-Preserving Zero-Shot Image Editing via Stage-Wise Latent Injection in Diffusion Models

Dasol Jeong, Donggoo Kang, Jiwon Park, Hyebean Lee, Joonki Paik

TL;DR

The paper tackles zero-shot image editing by unifying text-guided and reference-guided approaches without fine-tuning. It introduces a diffusion-based framework with DDIM inversion and optimized null-text embeddings to preserve source structure, coupled with a stage-wise latent injection: shape injection in early timesteps to retain geometry, followed by attribute injection in later timesteps via cross-attention with reference latents for fine-grained transfer. This combination achieves strong semantic alignment and structural fidelity across expression transfer, texture transformation, and style infusion, surpassing baselines on both qualitative and quantitative metrics. The approach is scalable, avoids task-specific fine-tuning, and paves the way for flexible, high-fidelity edits in diverse domains, with future potential for multi-modal conditioning.

Abstract

We propose a diffusion-based framework for zero-shot image editing that unifies text-guided and reference-guided approaches without requiring fine-tuning. Our method leverages diffusion inversion and timestep-specific null-text embeddings to preserve the structural integrity of the source image. By introducing a stage-wise latent injection strategy-shape injection in early steps and attribute injection in later steps-we enable precise, fine-grained modifications while maintaining global consistency. Cross-attention with reference latents facilitates semantic alignment between the source and reference. Extensive experiments across expression transfer, texture transformation, and style infusion demonstrate state-of-the-art performance, confirming the method's scalability and adaptability to diverse image editing scenarios.

Structure-Preserving Zero-Shot Image Editing via Stage-Wise Latent Injection in Diffusion Models

TL;DR

The paper tackles zero-shot image editing by unifying text-guided and reference-guided approaches without fine-tuning. It introduces a diffusion-based framework with DDIM inversion and optimized null-text embeddings to preserve source structure, coupled with a stage-wise latent injection: shape injection in early timesteps to retain geometry, followed by attribute injection in later timesteps via cross-attention with reference latents for fine-grained transfer. This combination achieves strong semantic alignment and structural fidelity across expression transfer, texture transformation, and style infusion, surpassing baselines on both qualitative and quantitative metrics. The approach is scalable, avoids task-specific fine-tuning, and paves the way for flexible, high-fidelity edits in diverse domains, with future potential for multi-modal conditioning.

Abstract

We propose a diffusion-based framework for zero-shot image editing that unifies text-guided and reference-guided approaches without requiring fine-tuning. Our method leverages diffusion inversion and timestep-specific null-text embeddings to preserve the structural integrity of the source image. By introducing a stage-wise latent injection strategy-shape injection in early steps and attribute injection in later steps-we enable precise, fine-grained modifications while maintaining global consistency. Cross-attention with reference latents facilitates semantic alignment between the source and reference. Extensive experiments across expression transfer, texture transformation, and style infusion demonstrate state-of-the-art performance, confirming the method's scalability and adaptability to diverse image editing scenarios.

Paper Structure

This paper contains 23 sections, 5 equations, 8 figures, 1 table, 1 algorithm.

Figures (8)

  • Figure 1: Results of reference-guided image editing using the proposed method.
  • Figure 2: Inversion process for source and reference images. (a) Source Image Null-Text Inversion: The source image undergoes Null-Text Inversion (NTI) to optimize null-text embeddings, ensuring that its structural integrity is preserved throughout the editing process. This allows for precise text- or reference-guided modifications without fine-tuning. (b) Reference Image DDIM Inversion: The reference image is inverted using DDIM inversion, generating a latent representation that captures fine-grained attributes such as texture, style, and expression. These reference latents are later integrated into the denoising process to guide attribute injection.
  • Figure 3: An overview of the proposed zero-shot image editing framework. The process begins with DDIM inversion of the source image to extract its latent representation $z_t^{s*}$ and optimize null embeddings $\varnothing$ for structural preservation. Text embeddings $\mathcal{P}$ and reference image latents $z_t^{r*}$ guide the denoising U-Net during the editing process. The framework incorporates shape injection during early timesteps to maintain structural fidelity and attribute injection in later timesteps to transfer fine-grained semantic and stylistic attributes from the reference image. Self-attention and cross-attention mechanisms within the U-Net enable precise control over structural and semantic transformations, ensuring high-quality editing results.
  • Figure 4: The outputs show that when the source image is fixed, the proposed method accurately integrates stylistic attributes from various reference images while preserving the source’s structure.
  • Figure 5: Qualitative comparison of generated results across different methods on the AFHQ dataset. The first and second column contains the source and reference images. The subsequent columns display outputs from (left to right) Ours, InjectFusion, and DiffuseIT.
  • ...and 3 more figures