Rethink Arbitrary Style Transfer with Transformer and Contrastive Learning
Zhanjie Zhang, Jiakai Sun, Guangyuan Li, Lei Zhao, Quanwei Zhang, Zehua Lan, Haolin Yin, Wei Xing, Huaizhong Lin, Zhiwen Zuo
TL;DR
This work addresses quality and artifact issues in arbitrary style transfer. It introduces Style Consistency Instance Normalization (SCIN) to align content with global style using a transformer-based style extractor, Instance-based Contrastive Learning (ICL) to model stylization-to-stylization relations via CLIP-based embeddings, and a Perception Encoder (PE) to capture style information beyond fixed classification features. The approach is trained with a composite objective that includes perceptual/content loss, adversarial loss, identity loss, and the proposed contrastive loss, yielding $L = \lambda_{1} L_s + \lambda_{2} L_c + \lambda_{3} L_{identity} + \lambda_{4} L_{Adv} + \lambda_{5} L_{contra}$. Experiments on MS-COCO and WikiArt demonstrate improved stylization quality, reduced artifacts, and better preservation of content and local textures compared with state-of-the-art methods.
Abstract
Arbitrary style transfer holds widespread attention in research and boasts numerous practical applications. The existing methods, which either employ cross-attention to incorporate deep style attributes into content attributes or use adaptive normalization to adjust content features, fail to generate high-quality stylized images. In this paper, we introduce an innovative technique to improve the quality of stylized images. Firstly, we propose Style Consistency Instance Normalization (SCIN), a method to refine the alignment between content and style features. In addition, we have developed an Instance-based Contrastive Learning (ICL) approach designed to understand the relationships among various styles, thereby enhancing the quality of the resulting stylized images. Recognizing that VGG networks are more adept at extracting classification features and need to be better suited for capturing style features, we have also introduced the Perception Encoder (PE) to capture style features. Extensive experiments demonstrate that our proposed method generates high-quality stylized images and effectively prevents artifacts compared with the existing state-of-the-art methods.
