Table of Contents
Fetching ...

Sketch-Guided Scene Image Generation

Tianyu Zhang, Xiaoxuan Xie, Xusheng Du, Haoran Xie

TL;DR

The paper tackles the challenge of generating coherent scene images from hand-drawn sketches using diffusion models. It introduces a two-stage approach that first performs object-level cross-domain generation with ControlNet and identity embeddings to preserve object details, then constructs the scene at the latent level by separately generating the background and blending foregrounds using a global prompt with learned identity tokens. A masked diffusion loss enforces faithful reconstruction of object concepts, while a blended latent diffusion scheme and an alpha-balanced inference enable natural fusion of foreground and background. Quantitative and user studies show the method surpasses state-of-the-art sketch-guided diffusion models in object fidelity and sketch-image consistency. This work advances sketch-driven scene synthesis, offering a practical pathway for layout-consistent, detail-rich image generation from freehand sketches.

Abstract

Text-to-image models are showcasing the impressive ability to create high-quality and diverse generative images. Nevertheless, the transition from freehand sketches to complex scene images remains challenging using diffusion models. In this study, we propose a novel sketch-guided scene image generation framework, decomposing the task of scene image scene generation from sketch inputs into object-level cross-domain generation and scene-level image construction. We employ pre-trained diffusion models to convert each single object drawing into an image of the object, inferring additional details while maintaining the sparse sketch structure. In order to maintain the conceptual fidelity of the foreground during scene generation, we invert the visual features of object images into identity embeddings for scene generation. In scene-level image construction, we generate the latent representation of the scene image using the separated background prompts, and then blend the generated foreground objects according to the layout of the sketch input. To ensure the foreground objects' details remain unchanged while naturally composing the scene image, we infer the scene image on the blended latent representation using a global prompt that includes the trained identity tokens. Through qualitative and quantitative experiments, we demonstrate the ability of the proposed approach to generate scene images from hand-drawn sketches surpasses the state-of-the-art approaches.

Sketch-Guided Scene Image Generation

TL;DR

The paper tackles the challenge of generating coherent scene images from hand-drawn sketches using diffusion models. It introduces a two-stage approach that first performs object-level cross-domain generation with ControlNet and identity embeddings to preserve object details, then constructs the scene at the latent level by separately generating the background and blending foregrounds using a global prompt with learned identity tokens. A masked diffusion loss enforces faithful reconstruction of object concepts, while a blended latent diffusion scheme and an alpha-balanced inference enable natural fusion of foreground and background. Quantitative and user studies show the method surpasses state-of-the-art sketch-guided diffusion models in object fidelity and sketch-image consistency. This work advances sketch-driven scene synthesis, offering a practical pathway for layout-consistent, detail-rich image generation from freehand sketches.

Abstract

Text-to-image models are showcasing the impressive ability to create high-quality and diverse generative images. Nevertheless, the transition from freehand sketches to complex scene images remains challenging using diffusion models. In this study, we propose a novel sketch-guided scene image generation framework, decomposing the task of scene image scene generation from sketch inputs into object-level cross-domain generation and scene-level image construction. We employ pre-trained diffusion models to convert each single object drawing into an image of the object, inferring additional details while maintaining the sparse sketch structure. In order to maintain the conceptual fidelity of the foreground during scene generation, we invert the visual features of object images into identity embeddings for scene generation. In scene-level image construction, we generate the latent representation of the scene image using the separated background prompts, and then blend the generated foreground objects according to the layout of the sketch input. To ensure the foreground objects' details remain unchanged while naturally composing the scene image, we infer the scene image on the blended latent representation using a global prompt that includes the trained identity tokens. Through qualitative and quantitative experiments, we demonstrate the ability of the proposed approach to generate scene images from hand-drawn sketches surpasses the state-of-the-art approaches.
Paper Structure (14 sections, 5 equations, 8 figures, 1 table, 1 algorithm)

This paper contains 14 sections, 5 equations, 8 figures, 1 table, 1 algorithm.

Figures (8)

  • Figure 1: We present a cross-domain generation method from scene sketches to images. The results demonstrate that our method can generate complete semantic foreground and background, maintaining consistency with the input sketches and semantics.
  • Figure 2: The existing sketch-guided text-to-image diffusion models perform poorly in generating scene sketches. ControlNet zhang2023adding and T2I-Adapter mou2024t2i generate images with object loss and background neglect. FineControlNet choi2023finecontrolnet can generate shapes from the sketch, but they may not always match the semantics, such as the "tree" in the image. Additionally, FineControlNet exhibits obvious segmentation between foreground and background, making it difficult to naturally blend them together.
  • Figure 3: The proposed sketch-guided scene image generation framework consists of two main components: object-level generation and scene-level construction. (1) Object-level generation: given the input scene sketches, we annotate and separate individual object sketches and complete cross-domain object generation using ControlNet zhang2023adding. After generation, the images are segmented into masks, and Masked Diffusion Loss avrahami2023break is employed during training to reverse the visual features into unique identity embeddings. (2) Scene-level construction: In the trained diffusion model, we construct masks and initial foreground image that conform to the sketch space layout and guide the generation of the foreground during denosing process. We incorporate guidance in fewer inference steps, allowing the model greater freedom to iteratively refine and resolve scene inconsistencies between foreground and background.
  • Figure 4: The inference process in our method. In the blended inference process, background prompt $\mathcal{P}_b$ will be utilized to inference the background. The foreground image $x_{init}$ is encoded and noised to represent the foreground objects, mask $M_{init}$ is used to blend the latent representations of the foreground and background. In customized inference, we use a global prompt $\mathcal{P}_g$ containing special identity tokens to guide the model in generating images from the blended latent representations.
  • Figure 5: The generated results with different $\alpha$. $\alpha = 0$ means without the blended inference and $\alpha = 1$ represent the full blending during the inference process. We observed that the balance between layout accuracy and foreground-background consistency can be achieved within the range of 0.4 to 0.6.
  • ...and 3 more figures