Table of Contents
Fetching ...

WINE: Wavelet-Guided GAN Inversion and Editing for High-Fidelity Refinement

Chaewon Kim, Seung-Jun Moon, Gyeong-Moon Park

TL;DR

WINE tackles the persistent low-frequency bias in GAN inversion by introducing a frequency-domain approach that explicitly preserves high-frequency details. It combines a wavelet loss targeting high-frequency subbands with a wavelet fusion mechanism to transfer high-frequency information into the generator, enabling high-fidelity inversion and robust editing. The method demonstrates superior reconstruction quality and editability over state-of-the-art baselines across multiple datasets, supported by ablations and theoretical insights into sub-band information content. This frequency-aware framework has practical implications for more accurate image restoration and editing in GAN-based pipelines, and may generalize to other wavelet-augmented generators. The work thus advances high-fidelity inversion and editing by bridging spatial and spectral information through wavelet analysis.

Abstract

Recent advanced GAN inversion models aim to convey high-fidelity information from original images to generators through methods using generator tuning or high-dimensional feature learning. Despite these efforts, accurately reconstructing image-specific details remains as a challenge due to the inherent limitations both in terms of training and structural aspects, leading to a bias towards low-frequency information. In this paper, we look into the widely used pixel loss in GAN inversion, revealing its predominant focus on the reconstruction of low-frequency features. We then propose WINE, a Wavelet-guided GAN Inversion aNd Editing model, which transfers the high-frequency information through wavelet coefficients via newly proposed wavelet loss and wavelet fusion scheme. Notably, WINE is the first attempt to interpret GAN inversion in the frequency domain. Our experimental results showcase the precision of WINE in preserving high-frequency details and enhancing image quality. Even in editing scenarios, WINE outperforms existing state-of-the-art GAN inversion models with a fine balance between editability and reconstruction quality.

WINE: Wavelet-Guided GAN Inversion and Editing for High-Fidelity Refinement

TL;DR

WINE tackles the persistent low-frequency bias in GAN inversion by introducing a frequency-domain approach that explicitly preserves high-frequency details. It combines a wavelet loss targeting high-frequency subbands with a wavelet fusion mechanism to transfer high-frequency information into the generator, enabling high-fidelity inversion and robust editing. The method demonstrates superior reconstruction quality and editability over state-of-the-art baselines across multiple datasets, supported by ablations and theoretical insights into sub-band information content. This frequency-aware framework has practical implications for more accurate image restoration and editing in GAN-based pipelines, and may generalize to other wavelet-augmented generators. The work thus advances high-fidelity inversion and editing by bridging spatial and spectral information through wavelet analysis.

Abstract

Recent advanced GAN inversion models aim to convey high-fidelity information from original images to generators through methods using generator tuning or high-dimensional feature learning. Despite these efforts, accurately reconstructing image-specific details remains as a challenge due to the inherent limitations both in terms of training and structural aspects, leading to a bias towards low-frequency information. In this paper, we look into the widely used pixel loss in GAN inversion, revealing its predominant focus on the reconstruction of low-frequency features. We then propose WINE, a Wavelet-guided GAN Inversion aNd Editing model, which transfers the high-frequency information through wavelet coefficients via newly proposed wavelet loss and wavelet fusion scheme. Notably, WINE is the first attempt to interpret GAN inversion in the frequency domain. Our experimental results showcase the precision of WINE in preserving high-frequency details and enhancing image quality. Even in editing scenarios, WINE outperforms existing state-of-the-art GAN inversion models with a fine balance between editability and reconstruction quality.
Paper Structure (27 sections, 30 equations, 11 figures, 6 tables)

This paper contains 27 sections, 30 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Comparison of Recent GAN Inversion Models. We evaluate the recent high-fidelity GAN inversion models with the inversion of an intricate image. Specific regions demanding meticulous preservation of details, such as letters, lip shape, and eye pupil, are closely examined. Even with the high-rate inversion via residual learning, existing baselines encounter difficulties in adequately restoring these nuanced details. In contrast, our newly introduced WINE method excels in the robust preservation of such intricate details.
  • Figure 2: (a) Wavelet Transform. We plot the wavelet coefficients by each filter at $1^{st}$ wavelet decomposition. The gray color denotes the zero value. Coefficients from $LH$, $HL$, and $HH$, have significantly high sparsity than the coefficient from $LL$. (b) Comparison of $\mathcal{L}_{2}$ from Each Filter between Baseline Methods. We plot the average $\mathcal{L}_{2}$ of each wavelet coefficient between CelebA-HQ test images and corresponding inverted images by various state-of-the-art inversion models. Due to the significant gap between $\mathcal{L}_{2,LL}$ and the rest ($\sim 30 \times$ in linear scale), we display the losses with the logarithmic scale. (c) Comparison of Image Reconstruction Quality between Backbone Generators. We compared the single-image reconstruction ability of StyleGAN2 and SWAGAN in both visual and spectral aspects. As SWAGAN generates images in the spatial frequency domain, the generated image preserves high-frequency information.
  • Figure 3: Training Scheme of WINE. Given a pre-trained encoder $E_0$ and generator $G_0$, we can obtain an initial inverted image $\hat{X}_0$. The residual $\Delta$ contains high-fidelity details that $\hat{X}_0$ misses. The model leverages a trainable Adaptive Distortion Alignment ($ADA$) module to align the residual, which should ultimately be in alignment with $\hat{X}_0$ or the edited image $\hat{X}_0^{edit}$ at inference. From the aligned $\hat{\Delta}$, we can replenish the missing high-fidelity information with the two fusion modules $F_{feat}$ and $F_{wave}$. Fusion with each output is operated in the feature and frequency domain in separate intermediate layers. The final inversion result $\hat{X}$ contains rich information without the loss of high-frequency components. Note that $ADA$, $F_{feat}$, and $F_{wave}$ are all jointly trained, while $E_0$ and $G_0$ are frozen.
  • Figure 4: Qualitative Comparison between Inversion Results of Baselines. The baseline models including the state-of-the-art high-rate inversion models failed to preserve details, such as accessories and complex backgrounds. In contrast, inverted images through WINE showed robust reconstruction of image-wise details, e.g., eyelashes, camera, and legible letters for each row.
  • Figure 5: Qualitative Comparison between Editing Results of Baselines. We show edited images via InterFaceGAN (1-8th rows) and editing results via StyleCLIP (9-10th rows). Both low- and high-rate inversion baselines suffered from preserving details, while our proposed method efficiently restored high-fidelity details with satisfactory editability with highly disentangled editing performance.
  • ...and 6 more figures

Theorems & Definitions (2)

  • proof
  • proof