Structure-Preserving Zero-Shot Image Editing via Stage-Wise Latent Injection in Diffusion Models
Dasol Jeong, Donggoo Kang, Jiwon Park, Hyebean Lee, Joonki Paik
TL;DR
The paper tackles zero-shot image editing by unifying text-guided and reference-guided approaches without fine-tuning. It introduces a diffusion-based framework with DDIM inversion and optimized null-text embeddings to preserve source structure, coupled with a stage-wise latent injection: shape injection in early timesteps to retain geometry, followed by attribute injection in later timesteps via cross-attention with reference latents for fine-grained transfer. This combination achieves strong semantic alignment and structural fidelity across expression transfer, texture transformation, and style infusion, surpassing baselines on both qualitative and quantitative metrics. The approach is scalable, avoids task-specific fine-tuning, and paves the way for flexible, high-fidelity edits in diverse domains, with future potential for multi-modal conditioning.
Abstract
We propose a diffusion-based framework for zero-shot image editing that unifies text-guided and reference-guided approaches without requiring fine-tuning. Our method leverages diffusion inversion and timestep-specific null-text embeddings to preserve the structural integrity of the source image. By introducing a stage-wise latent injection strategy-shape injection in early steps and attribute injection in later steps-we enable precise, fine-grained modifications while maintaining global consistency. Cross-attention with reference latents facilitates semantic alignment between the source and reference. Extensive experiments across expression transfer, texture transformation, and style infusion demonstrate state-of-the-art performance, confirming the method's scalability and adaptability to diverse image editing scenarios.
