Deformable One-shot Face Stylization via DINO Semantic Guidance
Yang Zhou, Zichong Chen, Hui Huang
TL;DR
This work tackles one-shot deformable face stylization by leveraging a real-style paired reference and a DINO-based semantic guidance to learn cross-domain structural deformation. A deformation-aware StyleGAN2 with TPS-STN modules is fine-tuned using two novel cross-domain losses—directional deformation and relative structural consistency—alongside an adversarial style transfer component and color alignment via style-mixing. The framework achieves expressive geometric exaggerations while preserving identity, with efficient training (~10 minutes) and strong qualitative, quantitative, and user-study results surpassing state-of-the-art one-shot methods. The approach demonstrates practical potential for flexible, high-fidelity stylization in real-world applications where paired references are available.
Abstract
This paper addresses the complex issue of one-shot face stylization, focusing on the simultaneous consideration of appearance and structure, where previous methods have fallen short. We explore deformation-aware face stylization that diverges from traditional single-image style reference, opting for a real-style image pair instead. The cornerstone of our method is the utilization of a self-supervised vision transformer, specifically DINO-ViT, to establish a robust and consistent facial structure representation across both real and style domains. Our stylization process begins by adapting the StyleGAN generator to be deformation-aware through the integration of spatial transformers (STN). We then introduce two innovative constraints for generator fine-tuning under the guidance of DINO semantics: i) a directional deformation loss that regulates directional vectors in DINO space, and ii) a relative structural consistency constraint based on DINO token self-similarities, ensuring diverse generation. Additionally, style-mixing is employed to align the color generation with the reference, minimizing inconsistent correspondences. This framework delivers enhanced deformability for general one-shot face stylization, achieving notable efficiency with a fine-tuning duration of approximately 10 minutes. Extensive qualitative and quantitative comparisons demonstrate our superiority over state-of-the-art one-shot face stylization methods. Code is available at https://github.com/zichongc/DoesFS
