Table of Contents
Fetching ...

Wavelet-Guided Acceleration of Text Inversion in Diffusion-Based Image Editing

Gwanhyeong Koo, Sunjae Yoon, Chang D. Yoo

TL;DR

This work tackles the latency of Null-text Inversion (NTI) in diffusion-based image editing by introducing WaveOpt-Estimator, which uses wavelet-based frequency analysis to predict the endpoint $t^*$ of text optimization and restricts DDIM timesteps accordingly. By combining with Negative-Prompt Inversion (NPI) to initialize with a target prompt, the method achieves edits with comparable quality while delivering over an $80\%$ reduction in processing time. The estimator fuses image and wavelet features through cross-attention with DDIM latents to regress the remaining optimization horizon, and is trained with a dual loss that balances endpoint accuracy and PSNR-based fidelity. Empirically, the approach yields accurate $t^*$ predictions (MAE ~$2.9$ timesteps) and improves editing speed across NTI and NPI scenarios, enabling practical, high-quality diffusion-based image editing at scale. The technique offers a frequency-aware, efficient pathway to accelerate text-driven image editing without sacrificing output quality.

Abstract

In the field of image editing, Null-text Inversion (NTI) enables fine-grained editing while preserving the structure of the original image by optimizing null embeddings during the DDIM sampling process. However, the NTI process is time-consuming, taking more than two minutes per image. To address this, we introduce an innovative method that maintains the principles of the NTI while accelerating the image editing process. We propose the WaveOpt-Estimator, which determines the text optimization endpoint based on frequency characteristics. Utilizing wavelet transform analysis to identify the image's frequency characteristics, we can limit text optimization to specific timesteps during the DDIM sampling process. By adopting the Negative-Prompt Inversion (NPI) concept, a target prompt representing the original image serves as the initial text value for optimization. This approach maintains performance comparable to NTI while reducing the average editing time by over 80% compared to the NTI method. Our method presents a promising approach for efficient, high-quality image editing based on diffusion models.

Wavelet-Guided Acceleration of Text Inversion in Diffusion-Based Image Editing

TL;DR

This work tackles the latency of Null-text Inversion (NTI) in diffusion-based image editing by introducing WaveOpt-Estimator, which uses wavelet-based frequency analysis to predict the endpoint of text optimization and restricts DDIM timesteps accordingly. By combining with Negative-Prompt Inversion (NPI) to initialize with a target prompt, the method achieves edits with comparable quality while delivering over an reduction in processing time. The estimator fuses image and wavelet features through cross-attention with DDIM latents to regress the remaining optimization horizon, and is trained with a dual loss that balances endpoint accuracy and PSNR-based fidelity. Empirically, the approach yields accurate predictions (MAE ~ timesteps) and improves editing speed across NTI and NPI scenarios, enabling practical, high-quality diffusion-based image editing at scale. The technique offers a frequency-aware, efficient pathway to accelerate text-driven image editing without sacrificing output quality.

Abstract

In the field of image editing, Null-text Inversion (NTI) enables fine-grained editing while preserving the structure of the original image by optimizing null embeddings during the DDIM sampling process. However, the NTI process is time-consuming, taking more than two minutes per image. To address this, we introduce an innovative method that maintains the principles of the NTI while accelerating the image editing process. We propose the WaveOpt-Estimator, which determines the text optimization endpoint based on frequency characteristics. Utilizing wavelet transform analysis to identify the image's frequency characteristics, we can limit text optimization to specific timesteps during the DDIM sampling process. By adopting the Negative-Prompt Inversion (NPI) concept, a target prompt representing the original image serves as the initial text value for optimization. This approach maintains performance comparable to NTI while reducing the average editing time by over 80% compared to the NTI method. Our method presents a promising approach for efficient, high-quality image editing based on diffusion models.
Paper Structure (11 sections, 6 equations, 5 figures, 3 tables)

This paper contains 11 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Results of DDIM Sampling According to $t^*$
  • Figure 2: Results of PSNR between $x_{ori}$ and $x^{i^*}_{recon}$
  • Figure 3: Wavelet subbands and its energy
  • Figure 4: Overview of our proposed WaveOpt-Estimator model
  • Figure 5: Comparison of reconstruction and edit results with other methods