LASPA: Latent Spatial Alignment for Fast Training-free Single Image Editing
Yazeed Alharbi, Peter Wonka
TL;DR
LASPA addresses the challenge of fast, training-free single-image editing with diffusion models by introducing latent spatial alignment that leverages a real reference image to guide the reverse diffusion while keeping textual prompts fixed. It presents three primary alignment strategies—input, epsilon, and predicted $x_0$ alignment—plus semantic latent mixing guided by attention maps, enabling edits that preserve input details and achieve strong editing fidelity without finetuning or extra per-image storage. Across qualitative and quantitative evaluations, LASPA outperforms finetuning-based methods (e.g., Imagic, SINE) and inference-time baselines (e.g., SDEdit, DiffEdit), delivering edits in under 6 seconds with high perceptual similarity to the input. The approach is hardware-friendly and scalable to higher resolutions and video, making rapid, mobile- and cloud-friendly diffusion-based editing feasible with minimal storage and computation.
Abstract
We present a novel, training-free approach for textual editing of real images using diffusion models. Unlike prior methods that rely on computationally expensive finetuning, our approach leverages LAtent SPatial Alignment (LASPA) to efficiently preserve image details. We demonstrate how the diffusion process is amenable to spatial guidance using a reference image, leading to semantically coherent edits. This eliminates the need for complex optimization and costly model finetuning, resulting in significantly faster editing compared to previous methods. Additionally, our method avoids the storage requirements associated with large finetuned models. These advantages make our approach particularly well-suited for editing on mobile devices and applications demanding rapid response times. While simple and fast, our method achieves 62-71\% preference in a user-study and significantly better model-based editing strength and image preservation scores.
