Table of Contents
Fetching ...

SINE: SINgle Image Editing with Text-to-Image Diffusion Models

Zhixing Zhang, Ligong Han, Arnab Ghosh, Dimitris Metaxas, Jian Ren

TL;DR

SINE tackles single-image editing by distilling knowledge from a pre-trained diffusion model into a fine-tuned model using model-based classifier-free guidance, and by employing patch-based fine-tuning to enable arbitrary-resolution outputs. The approach preserves content and geometry while allowing language-guided edits, addressing overfitting and language drift that plague single-image fine-tuning. Extensive experiments show high-fidelity edits, good text alignment, and successful high-resolution generation, outperforming several baselines. The work expands practical applications of diffusion models to editing unique images (e.g., paintings, sculptures) with broad editing capabilities and reasonable limitations.

Abstract

Recent works on diffusion models have demonstrated a strong capability for conditioning image generation, e.g., text-guided image synthesis. Such success inspires many efforts trying to use large-scale pre-trained diffusion models for tackling a challenging problem--real image editing. Works conducted in this area learn a unique textual token corresponding to several images containing the same object. However, under many circumstances, only one image is available, such as the painting of the Girl with a Pearl Earring. Using existing works on fine-tuning the pre-trained diffusion models with a single image causes severe overfitting issues. The information leakage from the pre-trained diffusion models makes editing can not keep the same content as the given image while creating new features depicted by the language guidance. This work aims to address the problem of single-image editing. We propose a novel model-based guidance built upon the classifier-free guidance so that the knowledge from the model trained on a single image can be distilled into the pre-trained diffusion model, enabling content creation even with one given image. Additionally, we propose a patch-based fine-tuning that can effectively help the model generate images of arbitrary resolution. We provide extensive experiments to validate the design choices of our approach and show promising editing capabilities, including changing style, content addition, and object manipulation. The code is available for research purposes at https://github.com/zhang-zx/SINE.git .

SINE: SINgle Image Editing with Text-to-Image Diffusion Models

TL;DR

SINE tackles single-image editing by distilling knowledge from a pre-trained diffusion model into a fine-tuned model using model-based classifier-free guidance, and by employing patch-based fine-tuning to enable arbitrary-resolution outputs. The approach preserves content and geometry while allowing language-guided edits, addressing overfitting and language drift that plague single-image fine-tuning. Extensive experiments show high-fidelity edits, good text alignment, and successful high-resolution generation, outperforming several baselines. The work expands practical applications of diffusion models to editing unique images (e.g., paintings, sculptures) with broad editing capabilities and reasonable limitations.

Abstract

Recent works on diffusion models have demonstrated a strong capability for conditioning image generation, e.g., text-guided image synthesis. Such success inspires many efforts trying to use large-scale pre-trained diffusion models for tackling a challenging problem--real image editing. Works conducted in this area learn a unique textual token corresponding to several images containing the same object. However, under many circumstances, only one image is available, such as the painting of the Girl with a Pearl Earring. Using existing works on fine-tuning the pre-trained diffusion models with a single image causes severe overfitting issues. The information leakage from the pre-trained diffusion models makes editing can not keep the same content as the given image while creating new features depicted by the language guidance. This work aims to address the problem of single-image editing. We propose a novel model-based guidance built upon the classifier-free guidance so that the knowledge from the model trained on a single image can be distilled into the pre-trained diffusion model, enabling content creation even with one given image. Additionally, we propose a patch-based fine-tuning that can effectively help the model generate images of arbitrary resolution. We provide extensive experiments to validate the design choices of our approach and show promising editing capabilities, including changing style, content addition, and object manipulation. The code is available for research purposes at https://github.com/zhang-zx/SINE.git .
Paper Structure (16 sections, 3 equations, 23 figures, 1 table)

This paper contains 16 sections, 3 equations, 23 figures, 1 table.

Figures (23)

  • Figure 1: With only one real image, i.e., Source Image, our method is able to manipulate and generate the content in various ways, such as changing style, adding context, modifying the object, and enlarging the resolution, through guidance from the text prompt.
  • Figure 2: Overview of our method. (a) Given a source image, we first randomly crop it into patches and get the corresponding latent code $\mathbf{z}$ with the pre-trained encoder. At fine-tune time, the denoising model, $\boldsymbol{\epsilon}_\theta$, takes three inputs: noisy latent $\mathbf{z}_T$, language condition $\mathbf{c}$, and positional embedding for the area where the noisy latent is obtained. (b) During sampling, we give additional language guidance about the target domain to edit the image. Also, we sample a noisy latent code $\mathbf{z}_T$ with the dimension corresponding to the desired output resolution. Language conditioning for $\boldsymbol{\epsilon}_\theta$ and $\mathbf{c}$ are given by pre-trained language encoder $\boldsymbol{\tau}_\theta$ with the target language guidance. While for the fine-tuned diffusion model, $\boldsymbol{\hat{\epsilon}}_\theta$, in addition to the language conditioning $\mathbf{\hat{c}}$, we also input the positional embedding for the whole image. We employ a linear combination between the score calculated by each model for the first $K$ steps and inference only on pre-trained $\boldsymbol{\epsilon}_\theta$ after.
  • Figure 3: Editing on single source image from various domains. We employ our method on various images and edit them with two target prompts at $512\times512$ resolution. We show the wide range of edits our approach can be used, including but not limited to style transfer, content add-on, posture change, breed change, etc.
  • Figure 4: Arbitrary resolution editing. Our method achieves higher-resolution image editing without artifacts like duplicates, even on ones that change the height-width ratio drastically.
  • Figure 5: Comparisons of various methods. We compare our method to DreamBooth ruiz2022dreambooth and Textual-Inversion gal2022image. On the left part of the figure, we edit at the resolution same as training time. On the right part, we edit the source image at a higher resolution. Our work successfully edits the image as required while preserving the details of the source images. We also compare our method without and with the patch-based fine-tuning mechanism (w/o pos vs. w/ pos). When editing at a fixed resolution, two settings perform equally, while at a higher resolution, the patch-based fine-tuning method successfully prevents artifacts.
  • ...and 18 more figures