Table of Contents
Fetching ...

LASPA: Latent Spatial Alignment for Fast Training-free Single Image Editing

Yazeed Alharbi, Peter Wonka

TL;DR

LASPA addresses the challenge of fast, training-free single-image editing with diffusion models by introducing latent spatial alignment that leverages a real reference image to guide the reverse diffusion while keeping textual prompts fixed. It presents three primary alignment strategies—input, epsilon, and predicted $x_0$ alignment—plus semantic latent mixing guided by attention maps, enabling edits that preserve input details and achieve strong editing fidelity without finetuning or extra per-image storage. Across qualitative and quantitative evaluations, LASPA outperforms finetuning-based methods (e.g., Imagic, SINE) and inference-time baselines (e.g., SDEdit, DiffEdit), delivering edits in under 6 seconds with high perceptual similarity to the input. The approach is hardware-friendly and scalable to higher resolutions and video, making rapid, mobile- and cloud-friendly diffusion-based editing feasible with minimal storage and computation.

Abstract

We present a novel, training-free approach for textual editing of real images using diffusion models. Unlike prior methods that rely on computationally expensive finetuning, our approach leverages LAtent SPatial Alignment (LASPA) to efficiently preserve image details. We demonstrate how the diffusion process is amenable to spatial guidance using a reference image, leading to semantically coherent edits. This eliminates the need for complex optimization and costly model finetuning, resulting in significantly faster editing compared to previous methods. Additionally, our method avoids the storage requirements associated with large finetuned models. These advantages make our approach particularly well-suited for editing on mobile devices and applications demanding rapid response times. While simple and fast, our method achieves 62-71\% preference in a user-study and significantly better model-based editing strength and image preservation scores.

LASPA: Latent Spatial Alignment for Fast Training-free Single Image Editing

TL;DR

LASPA addresses the challenge of fast, training-free single-image editing with diffusion models by introducing latent spatial alignment that leverages a real reference image to guide the reverse diffusion while keeping textual prompts fixed. It presents three primary alignment strategies—input, epsilon, and predicted alignment—plus semantic latent mixing guided by attention maps, enabling edits that preserve input details and achieve strong editing fidelity without finetuning or extra per-image storage. Across qualitative and quantitative evaluations, LASPA outperforms finetuning-based methods (e.g., Imagic, SINE) and inference-time baselines (e.g., SDEdit, DiffEdit), delivering edits in under 6 seconds with high perceptual similarity to the input. The approach is hardware-friendly and scalable to higher resolutions and video, making rapid, mobile- and cloud-friendly diffusion-based editing feasible with minimal storage and computation.

Abstract

We present a novel, training-free approach for textual editing of real images using diffusion models. Unlike prior methods that rely on computationally expensive finetuning, our approach leverages LAtent SPatial Alignment (LASPA) to efficiently preserve image details. We demonstrate how the diffusion process is amenable to spatial guidance using a reference image, leading to semantically coherent edits. This eliminates the need for complex optimization and costly model finetuning, resulting in significantly faster editing compared to previous methods. Additionally, our method avoids the storage requirements associated with large finetuned models. These advantages make our approach particularly well-suited for editing on mobile devices and applications demanding rapid response times. While simple and fast, our method achieves 62-71\% preference in a user-study and significantly better model-based editing strength and image preservation scores.
Paper Structure (27 sections, 10 equations, 17 figures, 1 table)

This paper contains 27 sections, 10 equations, 17 figures, 1 table.

Figures (17)

  • Figure 1: Using a single real image as input, our method is capable of editing using textual prompts in less than 6 seconds without finetuning the diffusion model or using costly image embedding algorithms. Our results show accurate editing for realistic as well as artistic edits.
  • Figure 2: An illustration of our alignment methods and how they influence reverse diffusion steps.
  • Figure 3: Our latent alignment smoothly incorporates information from both the editing prompt and the input image. The editing prompt used is "a photo of an angry woman."
  • Figure 4: Our proposed alignment methods lead to accurate attention maps and can be used for background editing using semantic latent mixing.
  • Figure 5: A visual comparison of editing results between state-of-the-art methods and our method (LASPA). Our method exceeds the quality of methods that require finetuning, while editing only at inference time.
  • ...and 12 more figures