Table of Contents
Fetching ...

CustomSketching: Sketch Concept Extraction for Sketch-based Image Synthesis and Editing

Chufeng Xiao, Hongbo Fu

TL;DR

The paper tackles the limitation of text-only personalization in sketch-based image synthesis by introducing sketch concept extraction. It proposes CustomSketching, a two-stage framework that learns a new textual token $[v]$ and dual sketch encoders for contour ($S_C$) and detail ($S_D$) to enable fine-grained, sketch-guided editing within a pre-trained diffusion model. Key contributions include the novel task, a dual-sketch calmative representation with a masked encoder, and a loss design combining $\mathcal{L}_{rec}$, $\mathcal{L}_{shape}$, and $\mathcal{L}_{reg}$, validated on a new dataset with a user study and multiple applications (local editing, concept transfer, multi-concept generation, style variation). The approach improves identity preservation and reconstruction quality over adapted baselines while providing enhanced editability and controllability, enabling plug-and-play multi-concept generation. Limitations include the low-resolution latent space and per-concept training time, suggesting future work on higher resolution, faster personalization, and broader applicability.

Abstract

Personalization techniques for large text-to-image (T2I) models allow users to incorporate new concepts from reference images. However, existing methods primarily rely on textual descriptions, leading to limited control over customized images and failing to support fine-grained and local editing (e.g., shape, pose, and details). In this paper, we identify sketches as an intuitive and versatile representation that can facilitate such control, e.g., contour lines capturing shape information and flow lines representing texture. This motivates us to explore a novel task of sketch concept extraction: given one or more sketch-image pairs, we aim to extract a special sketch concept that bridges the correspondence between the images and sketches, thus enabling sketch-based image synthesis and editing at a fine-grained level. To accomplish this, we introduce CustomSketching, a two-stage framework for extracting novel sketch concepts. Considering that an object can often be depicted by a contour for general shapes and additional strokes for internal details, we introduce a dual-sketch representation to reduce the inherent ambiguity in sketch depiction. We employ a shape loss and a regularization loss to balance fidelity and editability during optimization. Through extensive experiments, a user study, and several applications, we show our method is effective and superior to the adapted baselines.

CustomSketching: Sketch Concept Extraction for Sketch-based Image Synthesis and Editing

TL;DR

The paper tackles the limitation of text-only personalization in sketch-based image synthesis by introducing sketch concept extraction. It proposes CustomSketching, a two-stage framework that learns a new textual token and dual sketch encoders for contour () and detail () to enable fine-grained, sketch-guided editing within a pre-trained diffusion model. Key contributions include the novel task, a dual-sketch calmative representation with a masked encoder, and a loss design combining , , and , validated on a new dataset with a user study and multiple applications (local editing, concept transfer, multi-concept generation, style variation). The approach improves identity preservation and reconstruction quality over adapted baselines while providing enhanced editability and controllability, enabling plug-and-play multi-concept generation. Limitations include the low-resolution latent space and per-concept training time, suggesting future work on higher resolution, faster personalization, and broader applicability.

Abstract

Personalization techniques for large text-to-image (T2I) models allow users to incorporate new concepts from reference images. However, existing methods primarily rely on textual descriptions, leading to limited control over customized images and failing to support fine-grained and local editing (e.g., shape, pose, and details). In this paper, we identify sketches as an intuitive and versatile representation that can facilitate such control, e.g., contour lines capturing shape information and flow lines representing texture. This motivates us to explore a novel task of sketch concept extraction: given one or more sketch-image pairs, we aim to extract a special sketch concept that bridges the correspondence between the images and sketches, thus enabling sketch-based image synthesis and editing at a fine-grained level. To accomplish this, we introduce CustomSketching, a two-stage framework for extracting novel sketch concepts. Considering that an object can often be depicted by a contour for general shapes and additional strokes for internal details, we introduce a dual-sketch representation to reduce the inherent ambiguity in sketch depiction. We employ a shape loss and a regularization loss to balance fidelity and editability during optimization. Through extensive experiments, a user study, and several applications, we show our method is effective and superior to the adapted baselines.
Paper Structure (12 sections, 8 equations, 21 figures, 2 tables)

This paper contains 12 sections, 8 equations, 21 figures, 2 tables.

Figures (21)

  • Figure 1: Given one or several sketch-image pairs as training data, our CustomSketching can learn a novel sketch concept into a text token $[v]$ and specific sketches. We decompose a sketch into shape lines (blue strokes) and detail lines (red strokes) to reduce the ambiguity in a sketch. Users may input a text prompt and a dual-sketch to re-create or edit the concept at a fine-grained level.
  • Figure 2: Given a text prompt (a, bottom) and a sketch (b) depicting specific semantics (e.g., clothing folds and hair), T2I-adapter (c) and ControlNet (d) could not correctly interpret the out-of-domain sketch types, while our method can extract such a novel sketch concept and reconstruct the reference image (a, top). Note that the reference image is not used by (c) and (d), and their results are for reference only.
  • Figure 3: The pipeline of our CustomSketching, which extracts novel sketch concepts for fine-grained image synthesis and editing via a two-stage framework. During training, given one or a few sketch-image pairs, Stage I only optimizes a textual embedding of a newly added token $[v]$ to represent the global semantics of the reference image(s), while Stage II jointly fine-tunes the token and two sketch encoders to reconstruct the concept in terms of local appearance and geometry. We adopt a dual-sketch representation to differentiate shape lines $S_C$ and detail lines $S_D$. During inference, users may provide a text prompt and a dual-sketch to manipulate the learned concept.
  • Figure 4: Comparisons of the results generated by our method and three adapted baselines, given the same text prompt and sketch. In the sketch column, the top one is the annotated sketch corresponding to the original image for training while the bottom one is an edited sketch.
  • Figure 5: Box plots of the ratings in the perceptive user study. Each value above the median line is the average rate for each method. The higher, the better.
  • ...and 16 more figures