Table of Contents
Fetching ...

Inversion-by-Inversion: Exemplar-based Sketch-to-Photo Synthesis via Stochastic Differential Equations without Training

Ximing Xing, Chuang Wang, Haitao Zhou, Zhihao Hu, Chongxuan Li, Dong Xu, Qian Yu

TL;DR

This work tackles the problem of generating photo-realistic images from sketches while faithfully transferring appearance from exemplars. It introduces a training-free two-stage diffusion-based framework, Inversion-by-Inversion, comprising shape-enhancing inversion to enforce sketch geometry and full-control inversion to graft exemplar appearance, guided by a shape-energy and an appearance-energy function. The method achieves state-of-the-art quantitative and qualitative results on exemplar-based sketch-to-photo tasks, demonstrating robustness to style exemplars, stroke inputs, and freehand sketches without task-specific retraining. The approach promises practical impact for controllable image synthesis in AIGC applications by enabling flexible, exemplar-guided editing directly from sketches.

Abstract

Exemplar-based sketch-to-photo synthesis allows users to generate photo-realistic images based on sketches. Recently, diffusion-based methods have achieved impressive performance on image generation tasks, enabling highly-flexible control through text-driven generation or energy functions. However, generating photo-realistic images with color and texture from sketch images remains challenging for diffusion models. Sketches typically consist of only a few strokes, with most regions left blank, making it difficult for diffusion-based methods to produce photo-realistic images. In this work, we propose a two-stage method named ``Inversion-by-Inversion" for exemplar-based sketch-to-photo synthesis. This approach includes shape-enhancing inversion and full-control inversion. During the shape-enhancing inversion process, an uncolored photo is generated with the guidance of a shape-energy function. This step is essential to ensure control over the shape of the generated photo. In the full-control inversion process, we propose an appearance-energy function to control the color and texture of the final generated photo.Importantly, our Inversion-by-Inversion pipeline is training-free and can accept different types of exemplars for color and texture control. We conducted extensive experiments to evaluate our proposed method, and the results demonstrate its effectiveness. The code and project can be found at https://ximinng.github.io/inversion-by-inversion-project/.

Inversion-by-Inversion: Exemplar-based Sketch-to-Photo Synthesis via Stochastic Differential Equations without Training

TL;DR

This work tackles the problem of generating photo-realistic images from sketches while faithfully transferring appearance from exemplars. It introduces a training-free two-stage diffusion-based framework, Inversion-by-Inversion, comprising shape-enhancing inversion to enforce sketch geometry and full-control inversion to graft exemplar appearance, guided by a shape-energy and an appearance-energy function. The method achieves state-of-the-art quantitative and qualitative results on exemplar-based sketch-to-photo tasks, demonstrating robustness to style exemplars, stroke inputs, and freehand sketches without task-specific retraining. The approach promises practical impact for controllable image synthesis in AIGC applications by enabling flexible, exemplar-guided editing directly from sketches.

Abstract

Exemplar-based sketch-to-photo synthesis allows users to generate photo-realistic images based on sketches. Recently, diffusion-based methods have achieved impressive performance on image generation tasks, enabling highly-flexible control through text-driven generation or energy functions. However, generating photo-realistic images with color and texture from sketch images remains challenging for diffusion models. Sketches typically consist of only a few strokes, with most regions left blank, making it difficult for diffusion-based methods to produce photo-realistic images. In this work, we propose a two-stage method named ``Inversion-by-Inversion" for exemplar-based sketch-to-photo synthesis. This approach includes shape-enhancing inversion and full-control inversion. During the shape-enhancing inversion process, an uncolored photo is generated with the guidance of a shape-energy function. This step is essential to ensure control over the shape of the generated photo. In the full-control inversion process, we propose an appearance-energy function to control the color and texture of the final generated photo.Importantly, our Inversion-by-Inversion pipeline is training-free and can accept different types of exemplars for color and texture control. We conducted extensive experiments to evaluate our proposed method, and the results demonstrate its effectiveness. The code and project can be found at https://ximinng.github.io/inversion-by-inversion-project/.
Paper Structure (25 sections, 9 equations, 12 figures, 2 tables, 1 algorithm)

This paper contains 25 sections, 9 equations, 12 figures, 2 tables, 1 algorithm.

Figures (12)

  • Figure 1: Representative exemplar-based sketch-to-photo synthesis results of our proposed Inversion-by-Inversion method. The first four columns provide the sketch-to-photo synthesis results on the AFHQ dataset StarGAN_Choi_2020 and CelebA-HQ dataset karras2017progressive, and the original photos are displayed in the top-right corner of their corresponding sketches. The fifth to eighth columns show the synthesis results when using different types of exemplars. The last three columns are three examples when using free-hand sketches as the input. Sketches are from the Sketchy database sketchy.
  • Figure 2: The Illustration of our Inversion-by-Inversion Translation via SDE. A large domain gap exists between sketches and the photos, making it hard to directly generate the final output photo from a sketch. Therefore, we propose an Inversion-by-Inversion sketch-to-image translation method, which involves two stages of inversion, namely the shape-enhancing inversion and the full-control inversion. In our Inversion-by-Inversion method, we first generate intermediate uncolored photos from sketches using our shape-enhancing inversion and then generate the final results using our full-control inversion.
  • Figure 3: Overview of our Inversion-by-Inversion Translation via SDE. The blue, green, and orange contour plots represent the distributions of sketch, uncolored photo, and photo, respectively. The movement of the grey dot in the distribution denotes the sketch-to-photo synthesis process of our proposed Inversion-by-Inversion method. In our shape-enhancing inversion step (a), we first perturb the input sketch with the forward process of SDE. Then the inversion process of SDE will gradually remove the noise, and the uncolored photo is synthesized with the shape of the input sketch. During this procedure, we propose the shape-energy function to maintain the structure of the input sketch. After that, we perform the full-control inversion step (b) by first perturbing the uncolored photo and then using SDE inversion to denoise it. During this procedure, we use both the shape-energy function and the appearance-energy function for maintaining the structure of the input sketch and add the appearance (i.e., texture and color) from the given exemplar into the output photo (Best viewed in color).
  • Figure 4: Comparison of the results of different methods on Cat $\xrightarrow{}$ Dog. Considering ILVR, SDEdit and EGSDE cannot simultaneously take two images as input, we conducted two versions of experiments for these methods: using the sketch directly as input or mix up the sketch and exemplar as a single image and use it as the input. “Mixup 1” denotes a blended image of 70% sketch and 30% exemplar, and “Mixup 2” represents a blended image of 30% sketch and 70% exemplar. It is challenging for these methods to achieve a balance between the shape control and appearance control. Among all methods, our method achieves the highest visual quality and faithfulness.
  • Figure 5: Comparison of the results when using freehand sketches for shape control. The sketches are sampled from the sketchy sketchy dataset. Considering ILVR, SDEdit and EGSDE cannot simultaneously take the sketch and exemplar as input, we compare to use the combined images of sketches and exemplars ("Mixup1") as the input of ILVR, SDEdit and EGSDE.
  • ...and 7 more figures