Table of Contents
Fetching ...

Puff-Net: Efficient Style Transfer with Pure Content and Style Feature Fusion Network

Sizhe Zheng, Pan Gao, Peng Zhou, Jie Qin

TL;DR

Puff-Net addresses the challenge of achieving high-quality style transfer with preserved content while maintaining computational efficiency. It uses an encoder-only transformer to fuse pure content and pure style features, which are produced by two specialized feature extractors (an invertible neural network-based content extractor and a lite-transformer-based style extractor). The approach combines perceptual content and style losses with reconstruction/identity losses for the extractors, and leverages content-aware positional encoding to enhance alignment between content and style. Empirical results show Puff-Net achieves competitive stylization quality with lower model capacity and faster inference, enabling more practical on-device or real-time applications while preserving global structure.

Abstract

Style transfer aims to render an image with the artistic features of a style image, while maintaining the original structure. Various methods have been put forward for this task, but some challenges still exist. For instance, it is difficult for CNN-based methods to handle global information and long-range dependencies between input images, for which transformer-based methods have been proposed. Although transformers can better model the relationship between content and style images, they require high-cost hardware and time-consuming inference. To address these issues, we design a novel transformer model that includes only the encoder, thus significantly reducing the computational cost. In addition, we also find that existing style transfer methods may lead to images under-stylied or missing content. In order to achieve better stylization, we design a content feature extractor and a style feature extractor, based on which pure content and style images can be fed to the transformer. Finally, we propose a novel network termed Puff-Net, i.e., pure content and style feature fusion network. Through qualitative and quantitative experiments, we demonstrate the advantages of our model compared to state-of-the-art ones in the literature.

Puff-Net: Efficient Style Transfer with Pure Content and Style Feature Fusion Network

TL;DR

Puff-Net addresses the challenge of achieving high-quality style transfer with preserved content while maintaining computational efficiency. It uses an encoder-only transformer to fuse pure content and pure style features, which are produced by two specialized feature extractors (an invertible neural network-based content extractor and a lite-transformer-based style extractor). The approach combines perceptual content and style losses with reconstruction/identity losses for the extractors, and leverages content-aware positional encoding to enhance alignment between content and style. Empirical results show Puff-Net achieves competitive stylization quality with lower model capacity and faster inference, enabling more practical on-device or real-time applications while preserving global structure.

Abstract

Style transfer aims to render an image with the artistic features of a style image, while maintaining the original structure. Various methods have been put forward for this task, but some challenges still exist. For instance, it is difficult for CNN-based methods to handle global information and long-range dependencies between input images, for which transformer-based methods have been proposed. Although transformers can better model the relationship between content and style images, they require high-cost hardware and time-consuming inference. To address these issues, we design a novel transformer model that includes only the encoder, thus significantly reducing the computational cost. In addition, we also find that existing style transfer methods may lead to images under-stylied or missing content. In order to achieve better stylization, we design a content feature extractor and a style feature extractor, based on which pure content and style images can be fed to the transformer. Finally, we propose a novel network termed Puff-Net, i.e., pure content and style feature fusion network. Through qualitative and quantitative experiments, we demonstrate the advantages of our model compared to state-of-the-art ones in the literature.
Paper Structure (22 sections, 7 equations, 11 figures, 2 tables)

This paper contains 22 sections, 7 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Comparison of different models based on loss and capacity, with the loss being a combination of $60\%$ content loss and $40\%$ style loss. Our model shows a favorable balance between capacity and loss. Details can be found in the Method and Experiments sections.
  • Figure 2: Some results of our Puff-Net. Our method achieves a better balance between maintaining stylized effects and reducing computational costs. The main body and background of the content image can be stylized more reasonably based on the style image.
  • Figure 3: Schematic illustration of the Puff-Net architecture. The network begins by extracting content and style features from the input images. These features are then divided into patches and encoded into patch sequences through a linear projection. After feeding the features into the transformer for stylization, we can finally obtain the result image through the decoder. Additionally, the model leverages a reconstruction loss function during training to enhance its ability to reconstruct content and style features.
  • Figure 4: The visual results of qualitative comparisons
  • Figure 5: The features extracted by our model. The first and second rows display renderings of the content extractor, while the third and fourth rows display renderings of the style extractor.
  • ...and 6 more figures