Table of Contents
Fetching ...

SketchingReality: From Freehand Scene Sketches To Photorealistic Images

Ahmed Bourouis, Mikhail Bessmeltsev, Yulia Gryaditskaya

TL;DR

A modulation-based approach that prioritizes semantic interpretation of the sketch over strict adherence to individual edge positions is proposed and shows that it outperforms existing approaches in both semantic alignment with freehand sketch inputs and in the realism and overall quality of the generated images.

Abstract

Recent years have witnessed remarkable progress in generative AI, with natural language emerging as the most common conditioning input. As underlying models grow more powerful, researchers are exploring increasingly diverse conditioning signals, such as depth maps, edge maps, camera parameters, and reference images, to give users finer control over generation. Among different modalities, sketches are a natural and long-standing form of human communication, enabling rapid expression of visual concepts. Previous literature has largely focused on edge maps, often misnamed 'sketches', yet algorithms that effectively handle true freehand sketches, with their inherent abstraction and distortions, remain underexplored. We pursue the challenging goal of balancing photorealism with sketch adherence when generating images from freehand input. A key obstacle is the absence of ground-truth, pixel-aligned images: by their nature, freehand sketches do not have a single correct alignment. To address this, we propose a modulation-based approach that prioritizes semantic interpretation of the sketch over strict adherence to individual edge positions. We further introduce a novel loss that enables training on freehand sketches without requiring ground-truth pixel-aligned images. We show that our method outperforms existing approaches in both semantic alignment with freehand sketch inputs and in the realism and overall quality of the generated images.

SketchingReality: From Freehand Scene Sketches To Photorealistic Images

TL;DR

A modulation-based approach that prioritizes semantic interpretation of the sketch over strict adherence to individual edge positions is proposed and shows that it outperforms existing approaches in both semantic alignment with freehand sketch inputs and in the realism and overall quality of the generated images.

Abstract

Recent years have witnessed remarkable progress in generative AI, with natural language emerging as the most common conditioning input. As underlying models grow more powerful, researchers are exploring increasingly diverse conditioning signals, such as depth maps, edge maps, camera parameters, and reference images, to give users finer control over generation. Among different modalities, sketches are a natural and long-standing form of human communication, enabling rapid expression of visual concepts. Previous literature has largely focused on edge maps, often misnamed 'sketches', yet algorithms that effectively handle true freehand sketches, with their inherent abstraction and distortions, remain underexplored. We pursue the challenging goal of balancing photorealism with sketch adherence when generating images from freehand input. A key obstacle is the absence of ground-truth, pixel-aligned images: by their nature, freehand sketches do not have a single correct alignment. To address this, we propose a modulation-based approach that prioritizes semantic interpretation of the sketch over strict adherence to individual edge positions. We further introduce a novel loss that enables training on freehand sketches without requiring ground-truth pixel-aligned images. We show that our method outperforms existing approaches in both semantic alignment with freehand sketch inputs and in the realism and overall quality of the generated images.
Paper Structure (46 sections, 6 equations, 19 figures, 9 tables)

This paper contains 46 sections, 6 equations, 19 figures, 9 tables.

Figures (19)

  • Figure 1: With just a few strokes, sketches can convey complex visual concepts that are difficult to express in words, making them a natural conditioning choice for efficient, human-centered, controllable generative AI. Our method takes as input a freehand sketch together with a text prompt. The figure compares our results with state-of-the-art baselines: ControlNet zhang2023adding and T2I-Adapter mou2023t2i --- on freehand sketches from the FS-COCO dataset chowdhury2022fs. SD2.1 and SDXL represent the used backbones: rombach2022high and podell2023sdxl, respectively. The first column shows the reference image presented to participants, who recreated it from memory within a limited time, simulating how humans draw from a mental image. Our approach achieves a strong balance between sketch adherence and photorealism.
  • Figure 2: Method overview. Modulation network (a). During training, the latent features $z_t$ are generated by applying a noising process to the clean latent $z_0$, which is obtained by encoding the ground-truth image using the VAE encoder. During inference, $z_t$ is instead sampled from a standard Gaussian distribution and then progressively denoised by the diffusion model. Text-conditioned diffusion model generates time-dependent noise $\epsilon_t$, which we modulate relying on semantic sketch features. For details, please refer to \ref{['sec:modulation_network']} and \ref{['tab:modulation_network']}. Attention supervision (b). Attention supervision allows us to train on a combination of freehand sketches and sketches algorithmically generated from reference images. It bypasses the need for pixel-aligned ground-truth images --- which are unavailable for freehand sketches --- and helps the modulation network focus on sketch semantics. For details, please refer to \ref{['sec:attention']}.
  • Figure 3: Visual comparison between our method and baselines in a zero-shot setting (using weights of pretrained models) and with the ones fine-tuned with the proposed attention loss on a mix of freehand and synthetic sketches.
  • Figure 4: Quantitative evaluation of the role of sketch representation. Please refer to \ref{['sec:ablate_sketch_input']} for the details.
  • Figure 5: Qualitative and quantitative evaluation when we (i) remove the attention loss $\mathcal{L}_\text{attn}$ (\ref{['eq:attn']}), 'WO attention loss', and (ii) when we remove $\mathcal{L}_\text{var}$ (\ref{['eq:var_loss']}), 'WO variance loss'. Please refer to \ref{['sec:ablate_losses']} for the detailed discussion. Captions are taken as-is from the FSCOCO dataset.
  • ...and 14 more figures