Table of Contents
Fetching ...

Rethink Arbitrary Style Transfer with Transformer and Contrastive Learning

Zhanjie Zhang, Jiakai Sun, Guangyuan Li, Lei Zhao, Quanwei Zhang, Zehua Lan, Haolin Yin, Wei Xing, Huaizhong Lin, Zhiwen Zuo

TL;DR

This work addresses quality and artifact issues in arbitrary style transfer. It introduces Style Consistency Instance Normalization (SCIN) to align content with global style using a transformer-based style extractor, Instance-based Contrastive Learning (ICL) to model stylization-to-stylization relations via CLIP-based embeddings, and a Perception Encoder (PE) to capture style information beyond fixed classification features. The approach is trained with a composite objective that includes perceptual/content loss, adversarial loss, identity loss, and the proposed contrastive loss, yielding $L = \lambda_{1} L_s + \lambda_{2} L_c + \lambda_{3} L_{identity} + \lambda_{4} L_{Adv} + \lambda_{5} L_{contra}$. Experiments on MS-COCO and WikiArt demonstrate improved stylization quality, reduced artifacts, and better preservation of content and local textures compared with state-of-the-art methods.

Abstract

Arbitrary style transfer holds widespread attention in research and boasts numerous practical applications. The existing methods, which either employ cross-attention to incorporate deep style attributes into content attributes or use adaptive normalization to adjust content features, fail to generate high-quality stylized images. In this paper, we introduce an innovative technique to improve the quality of stylized images. Firstly, we propose Style Consistency Instance Normalization (SCIN), a method to refine the alignment between content and style features. In addition, we have developed an Instance-based Contrastive Learning (ICL) approach designed to understand the relationships among various styles, thereby enhancing the quality of the resulting stylized images. Recognizing that VGG networks are more adept at extracting classification features and need to be better suited for capturing style features, we have also introduced the Perception Encoder (PE) to capture style features. Extensive experiments demonstrate that our proposed method generates high-quality stylized images and effectively prevents artifacts compared with the existing state-of-the-art methods.

Rethink Arbitrary Style Transfer with Transformer and Contrastive Learning

TL;DR

This work addresses quality and artifact issues in arbitrary style transfer. It introduces Style Consistency Instance Normalization (SCIN) to align content with global style using a transformer-based style extractor, Instance-based Contrastive Learning (ICL) to model stylization-to-stylization relations via CLIP-based embeddings, and a Perception Encoder (PE) to capture style information beyond fixed classification features. The approach is trained with a composite objective that includes perceptual/content loss, adversarial loss, identity loss, and the proposed contrastive loss, yielding . Experiments on MS-COCO and WikiArt demonstrate improved stylization quality, reduced artifacts, and better preservation of content and local textures compared with state-of-the-art methods.

Abstract

Arbitrary style transfer holds widespread attention in research and boasts numerous practical applications. The existing methods, which either employ cross-attention to incorporate deep style attributes into content attributes or use adaptive normalization to adjust content features, fail to generate high-quality stylized images. In this paper, we introduce an innovative technique to improve the quality of stylized images. Firstly, we propose Style Consistency Instance Normalization (SCIN), a method to refine the alignment between content and style features. In addition, we have developed an Instance-based Contrastive Learning (ICL) approach designed to understand the relationships among various styles, thereby enhancing the quality of the resulting stylized images. Recognizing that VGG networks are more adept at extracting classification features and need to be better suited for capturing style features, we have also introduced the Perception Encoder (PE) to capture style features. Extensive experiments demonstrate that our proposed method generates high-quality stylized images and effectively prevents artifacts compared with the existing state-of-the-art methods.
Paper Structure (18 sections, 18 equations, 8 figures, 1 table)

This paper contains 18 sections, 18 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Stylized examples of the existing arbitrary style transfer method. Although the attention-based arbitrary style transfer method can learn local texture and content-style correlation, they sometimes bring in the content feature of style images in Row 1. Non-attention-based arbitrary style transfer failed to learn detailed texture and also generated artifacts.
  • Figure 2: The overview of the proposed method which consists of pre-trained VGG and Clip encoder, Style Consistency Instance Normalization (SCIN) and discriminator.
  • Figure 3: The structure of our proposed SCIN which mainly consists of multi-head self-attention modules (MSA) and a feed-forward network (FFN).
  • Figure 4: The detail of Perception Encoder.
  • Figure 5: Qualitative comparison with other state-of-the-art arbitrary style transfer methods.
  • ...and 3 more figures