Table of Contents
Fetching ...

Block and Detail: Scaffolding Sketch-to-Image Generation

Vishnu Sarukkai, Lu Yuan, Mia Tang, Maneesh Agrawala, Kayvon Fatahalian

TL;DR

A novel sketch-to-image tool that lets users sketch blocking strokes to coarsely represent the placement and form of objects and detail strokes to refine their shape and silhouettes is introduced and a two-pass algorithm for generating high-fidelity images from such sketches is developed.

Abstract

We introduce a novel sketch-to-image tool that aligns with the iterative refinement process of artists. Our tool lets users sketch blocking strokes to coarsely represent the placement and form of objects and detail strokes to refine their shape and silhouettes. We develop a two-pass algorithm for generating high-fidelity images from such sketches at any point in the iterative process. In the first pass we use a ControlNet to generate an image that strictly follows all the strokes (blocking and detail) and in the second pass we add variation by renoising regions surrounding blocking strokes. We also present a dataset generation scheme that, when used to train a ControlNet architecture, allows regions that do not contain strokes to be interpreted as not-yet-specified regions rather than empty space. We show that this partial-sketch-aware ControlNet can generate coherent elements from partial sketches that only contain a small number of strokes. The high-fidelity images produced by our approach serve as scaffolds that can help the user adjust the shape and proportions of objects or add additional elements to the composition. We demonstrate the effectiveness of our approach with a variety of examples and evaluative comparisons. Quantitatively, evaluative user feedback indicates that novice viewers prefer the quality of images from our algorithm over a baseline Scribble ControlNet for 84% of the pairs and found our images had less distortion in 81% of the pairs.

Block and Detail: Scaffolding Sketch-to-Image Generation

TL;DR

A novel sketch-to-image tool that lets users sketch blocking strokes to coarsely represent the placement and form of objects and detail strokes to refine their shape and silhouettes is introduced and a two-pass algorithm for generating high-fidelity images from such sketches is developed.

Abstract

We introduce a novel sketch-to-image tool that aligns with the iterative refinement process of artists. Our tool lets users sketch blocking strokes to coarsely represent the placement and form of objects and detail strokes to refine their shape and silhouettes. We develop a two-pass algorithm for generating high-fidelity images from such sketches at any point in the iterative process. In the first pass we use a ControlNet to generate an image that strictly follows all the strokes (blocking and detail) and in the second pass we add variation by renoising regions surrounding blocking strokes. We also present a dataset generation scheme that, when used to train a ControlNet architecture, allows regions that do not contain strokes to be interpreted as not-yet-specified regions rather than empty space. We show that this partial-sketch-aware ControlNet can generate coherent elements from partial sketches that only contain a small number of strokes. The high-fidelity images produced by our approach serve as scaffolds that can help the user adjust the shape and proportions of objects or add additional elements to the composition. We demonstrate the effectiveness of our approach with a variety of examples and evaluative comparisons. Quantitatively, evaluative user feedback indicates that novice viewers prefer the quality of images from our algorithm over a baseline Scribble ControlNet for 84% of the pairs and found our images had less distortion in 81% of the pairs.
Paper Structure (10 sections, 18 figures, 1 algorithm)

This paper contains 10 sections, 18 figures, 1 algorithm.

Figures (18)

  • Figure 1: In early stages of sketching, artists often specify object forms via rough blocking strokes. Standard ControlNet adheres too strictly to these strokes, creating images with object forms that are unrealistic or poorly proportioned (misshaped cat, overly circular flower, poorly proportioned scooter, simplified cupcake silhouette). Instead, our algorithm treats blocking strokes as rough guidelines for object form, enabling artists to generate visual inspiration that is both realistic and accurately matches their intent.
  • Figure 2: Our algorithm takes as input a text prompt ("a baseball photorealistic"), and a sketch consisting of blocking strokes (green) and detail strokes (black). In a first pass it feeds all strokes to a partial-sketch-aware ControlNet to produce an image, denoted as $I_{tc}$, that tightly adheres to all strokes. Here the contour of the baseball in $I_{tc}$ is misshapen because the input blocking strokes (green) are not quite circular. To generate variation in areas surrounding the blocking strokes, our algorithm applies a second diffusion pass we call Blended Renoising. Based on a renoising mask formed by dilating the input strokes, blended renoising generates variation in the area surrounding blocking strokes while preserving areas near detail strokes. The renoised output image corrects the baseball's contour, while the location of the stitching closely follows the user's detailed strokes. See Algorithm \ref{['alg:overview']} for pseudocode.
  • Figure 3: Dilation radius is more interpretable as a parameter for loosening spatial control than ControlNet strength. As the dilation radius of blocking strokes $\sigma_b$ increases, our algorithm maintains the rough shape of the vase while gradually allowing greater variation in the exact shape of its contour. In contrast, varying the ControlNet guidance strength without dilating the stokes leads to sudden unpredictable behavior as the strength is decreased. We lose adherence to the strokes as the hands appear in the third image and the vase completely loses its form by the fourth image.
  • Figure 4: Synthetic data generation: ordering strokes by their distance from the boundary of the foreground object's mask (gray tint) enables the preferential deletion of strokes furthest from the object boundary. For the car, the first 20% of the lines capture a portion of the silhouette. This enables us to train a partial-sketch ControlNet that auto-completes object forms and also attempts to fill in details.
  • Figure 5: Our Partial-Sketch-Aware ControlNet vs Scribble ControlNet zhang:2023:controlnet. Our Partial-Sketch-Aware ControlNet generates additional elements beyond the drawn strokes: generating multiple shoes instead of a single shoe for 'shoes', generating a second mug for 'two mugs', and generating a house with interesting foreground and background details from a simple outline for 'house'. The standard Scribble ControlNet fails to generate additional elements and produces minimal details in both the foreground and background of the house.
  • ...and 13 more figures