Table of Contents
Fetching ...

CoProSketch: Controllable and Progressive Sketch Generation with Diffusion Model

Ruohao Zhan, Yijin Li, Yisheng He, Shuo Chen, Yichen Shen, Xinyu Chen, Zilong Dong, Zhaoyang Huang, Guofeng Zhang

TL;DR

This work addresses the challenge of controllable and progressive sketch generation from text prompts and rough layouts, where end-to-end RGB diffusion struggles to meet artist-facing controls. The authors propose CoProSketch, a diffusion-based pipeline that uses a 2D unsigned distance field (UDF) to represent sketches and a lightweight UDF-to-sketch decoder for final output. A two-stage process generates a rough UDF within a provided bounding box, followed by a refined UDF and final sketch, with optional user edits at the rough stage and an instance-mask branch for layer-based composition. To train and evaluate the approach, they curate the first large-scale text-sketch paired dataset (~100k samples) and show improvements in semantic consistency and controllability over baselines.

Abstract

Sketches serve as fundamental blueprints in artistic creation because sketch editing is easier and more intuitive than pixel-level RGB image editing for painting artists, yet sketch generation remains unexplored despite advancements in generative models. We propose a novel framework CoProSketch, providing prominent controllability and details for sketch generation with diffusion models. A straightforward method is fine-tuning a pretrained image generation diffusion model with binarized sketch images. However, we find that the diffusion models fail to generate clear binary images, which makes the produced sketches chaotic. We thus propose to represent the sketches by unsigned distance field (UDF), which is continuous and can be easily decoded to sketches through a lightweight network. With CoProSketch, users generate a rough sketch from a bounding box and a text prompt. The rough sketch can be manually edited and fed back into the model for iterative refinement and will be decoded to a detailed sketch as the final result. Additionally, we curate the first large-scale text-sketch paired dataset as the training data. Experiments demonstrate superior semantic consistency and controllability over baselines, offering a practical solution for integrating user feedback into generative workflows.

CoProSketch: Controllable and Progressive Sketch Generation with Diffusion Model

TL;DR

This work addresses the challenge of controllable and progressive sketch generation from text prompts and rough layouts, where end-to-end RGB diffusion struggles to meet artist-facing controls. The authors propose CoProSketch, a diffusion-based pipeline that uses a 2D unsigned distance field (UDF) to represent sketches and a lightweight UDF-to-sketch decoder for final output. A two-stage process generates a rough UDF within a provided bounding box, followed by a refined UDF and final sketch, with optional user edits at the rough stage and an instance-mask branch for layer-based composition. To train and evaluate the approach, they curate the first large-scale text-sketch paired dataset (~100k samples) and show improvements in semantic consistency and controllability over baselines.

Abstract

Sketches serve as fundamental blueprints in artistic creation because sketch editing is easier and more intuitive than pixel-level RGB image editing for painting artists, yet sketch generation remains unexplored despite advancements in generative models. We propose a novel framework CoProSketch, providing prominent controllability and details for sketch generation with diffusion models. A straightforward method is fine-tuning a pretrained image generation diffusion model with binarized sketch images. However, we find that the diffusion models fail to generate clear binary images, which makes the produced sketches chaotic. We thus propose to represent the sketches by unsigned distance field (UDF), which is continuous and can be easily decoded to sketches through a lightweight network. With CoProSketch, users generate a rough sketch from a bounding box and a text prompt. The rough sketch can be manually edited and fed back into the model for iterative refinement and will be decoded to a detailed sketch as the final result. Additionally, we curate the first large-scale text-sketch paired dataset as the training data. Experiments demonstrate superior semantic consistency and controllability over baselines, offering a practical solution for integrating user feedback into generative workflows.

Paper Structure

This paper contains 18 sections, 2 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Demonstrations of the proposed pipeline. Left: The proposed pipeline takes a text prompt and an expected layout, represented by a bounding box, as input and generates sketches progressively, from rough to detailed. If the results are unsatisfactory, the user can make timely edits during the rough stage at a low cost. Right: one application is layer-based composition, where the layers (i.e., instance masks) and the sketches are both the output from the proposed pipeline.
  • Figure 2: (a): The proposed pipeline begins by taking a text prompt and a rough mask (derived from a bounding box) as input to generate a rough UDF representation. If users find the results unsatisfactory, they have the option to edit the rough result. The rough result, which is the sketch decoded from the UDF, is re-encoded back into the UDF after editing. The edited result is then converted into a instance mask, which is fed back into the same model, guided by a different stage indicator, to produce the refined output. (b): Details of our modified U-Net: The conditional mask is concatenated with the noisy latent. The stage indicator is first converted into an embedding and then added to the time embedding. All other components remain unchanged.
  • Figure 3: Details of UDF representation. Given a binarized sketch, we compute its UDF representation and transform it by $f(u)$ to adapt it for training networks.
  • Figure 4: Dataset construction process.
  • Figure 5: Dataset sample. (a): Rough sketches. (b): Detailed sketches.
  • ...and 4 more figures