Table of Contents
Fetching ...

Identity-preserving Distillation Sampling by Fixed-Point Iterator

SeonHwa Kim, Jiwon Kim, Soobin Park, Donghoon Ahn, Jiwon Kang, Seungryong Kim, Kyong Hwan Jin, Eunju Cha

TL;DR

We address the blurriness and identity drift in Score Distillation Sampling (SDS) for text-guided editing by introducing Identity-preserving Distillation Sampling (IDS) with Fixed-point Iterative Regularization (FPR). IDS explicitly corrects the text-conditioned score toward the source identity by refining the posterior mean via Tweedie’s formula and using guided noise from a re-estimated source latent, enabling stable, structure-preserving edits in both 2D images and editable NeRF. Empirical results show IDS with FPR outperforms baselines (DDS, CDS, P2P, PnP) on 2D image editing metrics (LPIPS, IoU, PSNR) and CLIP-based NeRF evaluations, with ablations underscoring the importance of FPR iterations and scale. This approach offers a practical, modular regularization for diffusion-based editing, improving identity preservation while maintaining prompt fidelity, with noted limitations and avenues for extending to target-aware scoring and reduced computation.

Abstract

Score distillation sampling (SDS) demonstrates a powerful capability for text-conditioned 2D image and 3D object generation by distilling the knowledge from learned score functions. However, SDS often suffers from blurriness caused by noisy gradients. When SDS meets the image editing, such degradations can be reduced by adjusting bias shifts using reference pairs, but the de-biasing techniques are still corrupted by erroneous gradients. To this end, we introduce Identity-preserving Distillation Sampling (IDS), which compensates for the gradient leading to undesired changes in the results. Based on the analysis that these errors come from the text-conditioned scores, a new regularization technique, called fixed-point iterative regularization (FPR), is proposed to modify the score itself, driving the preservation of the identity even including poses and structures. Thanks to a self-correction by FPR, the proposed method provides clear and unambiguous representations corresponding to the given prompts in image-to-image editing and editable neural radiance field (NeRF). The structural consistency between the source and the edited data is obviously maintained compared to other state-of-the-art methods.

Identity-preserving Distillation Sampling by Fixed-Point Iterator

TL;DR

We address the blurriness and identity drift in Score Distillation Sampling (SDS) for text-guided editing by introducing Identity-preserving Distillation Sampling (IDS) with Fixed-point Iterative Regularization (FPR). IDS explicitly corrects the text-conditioned score toward the source identity by refining the posterior mean via Tweedie’s formula and using guided noise from a re-estimated source latent, enabling stable, structure-preserving edits in both 2D images and editable NeRF. Empirical results show IDS with FPR outperforms baselines (DDS, CDS, P2P, PnP) on 2D image editing metrics (LPIPS, IoU, PSNR) and CLIP-based NeRF evaluations, with ablations underscoring the importance of FPR iterations and scale. This approach offers a practical, modular regularization for diffusion-based editing, improving identity preservation while maintaining prompt fidelity, with noted limitations and avenues for extending to target-aware scoring and reduced computation.

Abstract

Score distillation sampling (SDS) demonstrates a powerful capability for text-conditioned 2D image and 3D object generation by distilling the knowledge from learned score functions. However, SDS often suffers from blurriness caused by noisy gradients. When SDS meets the image editing, such degradations can be reduced by adjusting bias shifts using reference pairs, but the de-biasing techniques are still corrupted by erroneous gradients. To this end, we introduce Identity-preserving Distillation Sampling (IDS), which compensates for the gradient leading to undesired changes in the results. Based on the analysis that these errors come from the text-conditioned scores, a new regularization technique, called fixed-point iterative regularization (FPR), is proposed to modify the score itself, driving the preservation of the identity even including poses and structures. Thanks to a self-correction by FPR, the proposed method provides clear and unambiguous representations corresponding to the given prompts in image-to-image editing and editable neural radiance field (NeRF). The structural consistency between the source and the edited data is obviously maintained compared to other state-of-the-art methods.

Paper Structure

This paper contains 29 sections, 13 equations, 16 figures, 8 tables, 1 algorithm.

Figures (16)

  • Figure 1: Trace of guided updating from source to target images using delta denoising score (DDS) and identity-preserving distillation sampling (IDS). DDS moves a gradient of score function toward $\mathcal{M}_\mathbf{z}$ manifold directed by stochastic direction $\epsilon$. In contrast, IDS moves a gradient with a corrected direction by a fixed-point regularization.
  • Figure 2: Flowchart of IDS. The backbone of our algorithm employs DDS hertz2023delta framework to distill score function into a target image. Our fixed-point regularization (FPR) obtains a guided noise, $\epsilon^*$, from iterative updates using posterior mean computed by Tweedie’s formula. When distilling the score function to a target image, the guided noise is updated while maintaining the identity of the source.
  • Figure 3: Accumulated error in DDS.$\mathbf{z}^{\text{trg}}$ is edited image of source image $\mathbf{z}^{\text{src}}$ by prompt $y^{\text{src}} \rightarrow y^{\text{trg}}$. $\mathbf{z}^{\text{src} \dagger}$ is the inverted image of $\mathbf{z}^{\text{trg}}$ by prompt $y^{\text{trg}} \rightarrow y^{\text{src}}$. (First row) Inversion result of DDS with timestep $t\sim\mathcal{U}(0, 0.2)$. (Second row) Inversion result of DDS with $t\sim\mathcal{U}(0, 1)$. (Third row) Inversion result of ours.
  • Figure 4: Qualitative results of InstructPix2Pix dataset brooks2023instructpix2pix. Our method successfully edits the image aligning with the target text prompt while preserving the structural integrity of the source image.
  • Figure 5: Qualitative results on Synthetic 360$^\circ$ and LLFF datasets. IDS outperforms the baselines by preserving the structural consistency of the source image and maintaining the integrity of regions that should remain unchanged, while precisely editing only the areas specified by the target prompt. Furthermore, comparisons of the depth map results also highlight the superior consistency of our method over other baseline models.
  • ...and 11 more figures