Table of Contents
Fetching ...

d-Sketch: Improving Visual Fidelity of Sketch-to-Image Translation with Pretrained Latent Diffusion Models without Retraining

Prasun Roy, Saumik Bhattacharya, Subhankar Ghosh, Umapada Pal, Michael Blumenstein

TL;DR

This work tackles sketch-to-image translation under structural constraints by leveraging a pretrained latent diffusion model (LDM) without retraining the model itself. It introduces a lightweight Latent Code Translation Network (LCTN) that maps edge-map features into the LDM’s latent space, enabling high-fidelity, photorealistic image synthesis from rough sketches. A key innovation is the sampling strategy that starts from an intermediate latent $z_k$ (where $k/T$ is near 0.7–0.9) and applies $T$ denoising steps to preserve structure while enhancing realism. Experiments across multiple datasets show improved perceptual quality, structural fidelity, and generalization to unseen categories, with practical control over visual styles through prompts while maintaining sketch-guided shapes.

Abstract

Structural guidance in an image-to-image translation allows intricate control over the shapes of synthesized images. Generating high-quality realistic images from user-specified rough hand-drawn sketches is one such task that aims to impose a structural constraint on the conditional generation process. While the premise is intriguing for numerous use cases of content creation and academic research, the problem becomes fundamentally challenging due to substantial ambiguities in freehand sketches. Furthermore, balancing the trade-off between shape consistency and realistic generation contributes to additional complexity in the process. Existing approaches based on Generative Adversarial Networks (GANs) generally utilize conditional GANs or GAN inversions, often requiring application-specific data and optimization objectives. The recent introduction of Denoising Diffusion Probabilistic Models (DDPMs) achieves a generational leap for low-level visual attributes in general image synthesis. However, directly retraining a large-scale diffusion model on a domain-specific subtask is often extremely difficult due to demanding computation costs and insufficient data. In this paper, we introduce a technique for sketch-to-image translation by exploiting the feature generalization capabilities of a large-scale diffusion model without retraining. In particular, we use a learnable lightweight mapping network to achieve latent feature translation from source to target domain. Experimental results demonstrate that the proposed method outperforms the existing techniques in qualitative and quantitative benchmarks, allowing high-resolution realistic image synthesis from rough hand-drawn sketches.

d-Sketch: Improving Visual Fidelity of Sketch-to-Image Translation with Pretrained Latent Diffusion Models without Retraining

TL;DR

This work tackles sketch-to-image translation under structural constraints by leveraging a pretrained latent diffusion model (LDM) without retraining the model itself. It introduces a lightweight Latent Code Translation Network (LCTN) that maps edge-map features into the LDM’s latent space, enabling high-fidelity, photorealistic image synthesis from rough sketches. A key innovation is the sampling strategy that starts from an intermediate latent (where is near 0.7–0.9) and applies denoising steps to preserve structure while enhancing realism. Experiments across multiple datasets show improved perceptual quality, structural fidelity, and generalization to unseen categories, with practical control over visual styles through prompts while maintaining sketch-guided shapes.

Abstract

Structural guidance in an image-to-image translation allows intricate control over the shapes of synthesized images. Generating high-quality realistic images from user-specified rough hand-drawn sketches is one such task that aims to impose a structural constraint on the conditional generation process. While the premise is intriguing for numerous use cases of content creation and academic research, the problem becomes fundamentally challenging due to substantial ambiguities in freehand sketches. Furthermore, balancing the trade-off between shape consistency and realistic generation contributes to additional complexity in the process. Existing approaches based on Generative Adversarial Networks (GANs) generally utilize conditional GANs or GAN inversions, often requiring application-specific data and optimization objectives. The recent introduction of Denoising Diffusion Probabilistic Models (DDPMs) achieves a generational leap for low-level visual attributes in general image synthesis. However, directly retraining a large-scale diffusion model on a domain-specific subtask is often extremely difficult due to demanding computation costs and insufficient data. In this paper, we introduce a technique for sketch-to-image translation by exploiting the feature generalization capabilities of a large-scale diffusion model without retraining. In particular, we use a learnable lightweight mapping network to achieve latent feature translation from source to target domain. Experimental results demonstrate that the proposed method outperforms the existing techniques in qualitative and quantitative benchmarks, allowing high-resolution realistic image synthesis from rough hand-drawn sketches.

Paper Structure

This paper contains 7 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Structural ambiguity in hand-drawn sketches. (a) Subject image. (b)--(g) Freehand sketches drawn by different users. The examples are from the Sketchy dataset sangkloy2016sketchy.
  • Figure 2: Proposed training strategy for the Latent Code Translation Network (LCTN).
  • Figure 3: Proposed sampling strategy for the Latent Code Translation Network (LCTN).
  • Figure 4: Qualitative comparison of the proposed method with existing sketch-to-image translation techniques -- Pix2Pix isola2017image, CycleGAN zhu2017unpaired, AODA xiang2022adversarial, and LGP voynov2023sketch on Scribble ghosh2019interactive and QMUL song2017deepyu2016sketch datasets.
  • Figure 5: Qualitative comparison for distinct object classes with nearly identical shapes. The proposed method can produce high-quality, visually distinguishable objects in contrast to the ambiguous results generated by existing sketch-to-image translation techniques -- Pix2Pix isola2017image, CycleGAN zhu2017unpaired, AODA xiang2022adversarial, and LGP voynov2023sketch.
  • ...and 2 more figures