Table of Contents
Fetching ...

KnobGen: Controlling the Sophistication of Artwork in Sketch-Based Diffusion Models

Pouyan Navard, Amin Karimi Monsefi, Mengxi Zhou, Wei-Lun Chao, Alper Yilmaz, Rajiv Ramnath

TL;DR

KnobGen tackles the challenge of sketch-conditioned diffusion by unifying coarse and fine control through a dual-pathway architecture. It introduces a Coarse-Grained Controller (CGC) for high-level semantics and a Fine-Grained Controller (FGC) for detailed refinement, plus a training-time Modulator and an inference-time Inference Knob to balance and tailor output fidelity to user sketches. The approach leverages CLIP-based multimodal conditioning and cross-attention to fuse text and sketch semantics, while allowing plug-and-play FGCs (e.g., ControlNet, T2I-Adapter) for flexibility. Empirical results on MultiGen-20M and a newly collected sketch dataset show improved sketch alignment, image realism, and user-perceived quality over baselines such as ControlNet and T2I-Adapter, demonstrating practical applicability across novice and professional users.

Abstract

Recent advances in diffusion models have significantly improved text-to-image (T2I) generation, but they often struggle to balance fine-grained precision with high-level control. Methods like ControlNet and T2I-Adapter excel at following sketches by seasoned artists but tend to be overly rigid, replicating unintentional flaws in sketches from novice users. Meanwhile, coarse-grained methods, such as sketch-based abstraction frameworks, offer more accessible input handling but lack the precise control needed for detailed, professional use. To address these limitations, we propose KnobGen, a dual-pathway framework that democratizes sketch-based image generation by seamlessly adapting to varying levels of sketch complexity and user skill. KnobGen uses a Coarse-Grained Controller (CGC) module for high-level semantics and a Fine-Grained Controller (FGC) module for detailed refinement. The relative strength of these two modules can be adjusted through our knob inference mechanism to align with the user's specific needs. These mechanisms ensure that KnobGen can flexibly generate images from both novice sketches and those drawn by seasoned artists. This maintains control over the final output while preserving the natural appearance of the image, as evidenced on the MultiGen-20M dataset and a newly collected sketch dataset.

KnobGen: Controlling the Sophistication of Artwork in Sketch-Based Diffusion Models

TL;DR

KnobGen tackles the challenge of sketch-conditioned diffusion by unifying coarse and fine control through a dual-pathway architecture. It introduces a Coarse-Grained Controller (CGC) for high-level semantics and a Fine-Grained Controller (FGC) for detailed refinement, plus a training-time Modulator and an inference-time Inference Knob to balance and tailor output fidelity to user sketches. The approach leverages CLIP-based multimodal conditioning and cross-attention to fuse text and sketch semantics, while allowing plug-and-play FGCs (e.g., ControlNet, T2I-Adapter) for flexibility. Empirical results on MultiGen-20M and a newly collected sketch dataset show improved sketch alignment, image realism, and user-perceived quality over baselines such as ControlNet and T2I-Adapter, demonstrating practical applicability across novice and professional users.

Abstract

Recent advances in diffusion models have significantly improved text-to-image (T2I) generation, but they often struggle to balance fine-grained precision with high-level control. Methods like ControlNet and T2I-Adapter excel at following sketches by seasoned artists but tend to be overly rigid, replicating unintentional flaws in sketches from novice users. Meanwhile, coarse-grained methods, such as sketch-based abstraction frameworks, offer more accessible input handling but lack the precise control needed for detailed, professional use. To address these limitations, we propose KnobGen, a dual-pathway framework that democratizes sketch-based image generation by seamlessly adapting to varying levels of sketch complexity and user skill. KnobGen uses a Coarse-Grained Controller (CGC) module for high-level semantics and a Fine-Grained Controller (FGC) module for detailed refinement. The relative strength of these two modules can be adjusted through our knob inference mechanism to align with the user's specific needs. These mechanisms ensure that KnobGen can flexibly generate images from both novice sketches and those drawn by seasoned artists. This maintains control over the final output while preserving the natural appearance of the image, as evidenced on the MultiGen-20M dataset and a newly collected sketch dataset.
Paper Structure (32 sections, 6 equations, 11 figures, 1 table)

This paper contains 32 sections, 6 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: KnobGen. Our method democratizes sketch-based image generation by effectively handling a broad spectrum of sketch complexity and user drawing ability—from novice sketches to those made by seasoned artists—while maintaining the natural appearance of the image.
  • Figure 2: Qualitative results demonstrating the impact of varying the weighting scheme in T2I-Adapter model. Lower weights result in images that poorly align with the input sketch in terms of spatial conformity, while higher weights improve spatial conformity of the generated image to the input sketch. However, higher weight compromises the natural appearance of the generated images.
  • Figure 3: Comparison across various sketch-control in DM. (a) fine-grained control based method such as ControlNet or T2I-adapter rigidly resembles a novice sketch resulting in an unrealistic image (b) abstraction-aware frameworks such as koley2024s fails to capture fine grained-detials without text guidance(c) while our proposed KnobGen smoothes out the imperfection of the user drawing and preserves the features of the novice sketch. FGC: Fine-grained Controller, CGC: Coarse-grained Controller, E$_{T}$: Text Encoder, E$_{I}$: Image Encoder, DM: Diffusion Model.
  • Figure 4: KnobGen vs. baseline on novice sketches. KnobGen handles novice sketches by injecting features from the Micro and Macro Pathways in a controlled manner. Dual pathway design ensures that the generated image is faithful to the spatial layout of the original input sketch and the image has a natural appearance. Baseline methods, however, exhibit difficulty in maintaining these desired properties in their generations. We also provide examples with null prompt as an ablation study to demonstrate the robustness of KnobGen.
  • Figure 5: Overview of KnobGen during training and inference. A illustrates the training process, where the CGC and FGC modules are dynamically balanced by the modulator. B expands on the CGC module, detailing how high-level semantics from both text and image inputs are integrated. C shows the inference process, including the knob mechanism that allows user-driven control over the level of fine-grained detail in the final image.
  • ...and 6 more figures