Multi-Style Facial Sketch Synthesis through Masked Generative Modeling
Bowen Sun, Guo Lu, Shibao Zheng
TL;DR
The paper tackles facial sketch synthesis under data scarcity and limited style variability by presenting a lightweight, end-to-end framework based on masked generative modeling of VQ-GAN latent tokens, guided by a CLIP-based feature encoder and a style-conditioned transformer. It employs a two-stage training regime: pre-training the transformer with masked image modeling and then fine-tuning a decoder with pixel and perceptual losses, enabling high-quality multi-style sketch generation from a single photograph without auxiliary inputs. A continuous style parameter enables interpolation between styles beyond the training set, enhancing versatility for data augmentation and cross-modal recognition. Empirical results on CelebA and FS2K show state-of-the-art performance across standard metrics, with clear improvements in background-foreground separation and stylistic diversity. Overall, the approach provides a data-efficient, controllable pathway to multi-style FSS with practical implications for security, entertainment, and design.
Abstract
The facial sketch synthesis (FSS) model, capable of generating sketch portraits from given facial photographs, holds profound implications across multiple domains, encompassing cross-modal face recognition, entertainment, art, media, among others. However, the production of high-quality sketches remains a formidable task, primarily due to the challenges and flaws associated with three key factors: (1) the scarcity of artist-drawn data, (2) the constraints imposed by limited style types, and (3) the deficiencies of processing input information in existing models. To address these difficulties, we propose a lightweight end-to-end synthesis model that efficiently converts images to corresponding multi-stylized sketches, obviating the necessity for any supplementary inputs (\eg, 3D geometry). In this study, we overcome the issue of data insufficiency by incorporating semi-supervised learning into the training process. Additionally, we employ a feature extraction module and style embeddings to proficiently steer the generative transformer during the iterative prediction of masked image tokens, thus achieving a continuous stylized output that retains facial features accurately in sketches. The extensive experiments demonstrate that our method consistently outperforms previous algorithms across multiple benchmarks, exhibiting a discernible disparity.
