KnobGen: Controlling the Sophistication of Artwork in Sketch-Based Diffusion Models
Pouyan Navard, Amin Karimi Monsefi, Mengxi Zhou, Wei-Lun Chao, Alper Yilmaz, Rajiv Ramnath
TL;DR
KnobGen tackles the challenge of sketch-conditioned diffusion by unifying coarse and fine control through a dual-pathway architecture. It introduces a Coarse-Grained Controller (CGC) for high-level semantics and a Fine-Grained Controller (FGC) for detailed refinement, plus a training-time Modulator and an inference-time Inference Knob to balance and tailor output fidelity to user sketches. The approach leverages CLIP-based multimodal conditioning and cross-attention to fuse text and sketch semantics, while allowing plug-and-play FGCs (e.g., ControlNet, T2I-Adapter) for flexibility. Empirical results on MultiGen-20M and a newly collected sketch dataset show improved sketch alignment, image realism, and user-perceived quality over baselines such as ControlNet and T2I-Adapter, demonstrating practical applicability across novice and professional users.
Abstract
Recent advances in diffusion models have significantly improved text-to-image (T2I) generation, but they often struggle to balance fine-grained precision with high-level control. Methods like ControlNet and T2I-Adapter excel at following sketches by seasoned artists but tend to be overly rigid, replicating unintentional flaws in sketches from novice users. Meanwhile, coarse-grained methods, such as sketch-based abstraction frameworks, offer more accessible input handling but lack the precise control needed for detailed, professional use. To address these limitations, we propose KnobGen, a dual-pathway framework that democratizes sketch-based image generation by seamlessly adapting to varying levels of sketch complexity and user skill. KnobGen uses a Coarse-Grained Controller (CGC) module for high-level semantics and a Fine-Grained Controller (FGC) module for detailed refinement. The relative strength of these two modules can be adjusted through our knob inference mechanism to align with the user's specific needs. These mechanisms ensure that KnobGen can flexibly generate images from both novice sketches and those drawn by seasoned artists. This maintains control over the final output while preserving the natural appearance of the image, as evidenced on the MultiGen-20M dataset and a newly collected sketch dataset.
