Table of Contents
Fetching ...

DiffuseST: Unleashing the Capability of the Diffusion Model for Style Transfer

Ying Hu, Chenyi Zhuang, Pan Gao

TL;DR

This work proposes a novel and training-free approach for style transfer, combining textual embedding with spatial features and separating the injection of content or style, and adopts the BLIP-2 encoder to extract the textual representation of the style image.

Abstract

Style transfer aims to fuse the artistic representation of a style image with the structural information of a content image. Existing methods train specific networks or utilize pre-trained models to learn content and style features. However, they rely solely on textual or spatial representations that are inadequate to achieve the balance between content and style. In this work, we propose a novel and training-free approach for style transfer, combining textual embedding with spatial features and separating the injection of content or style. Specifically, we adopt the BLIP-2 encoder to extract the textual representation of the style image. We utilize the DDIM inversion technique to extract intermediate embeddings in content and style branches as spatial features. Finally, we harness the step-by-step property of diffusion models by separating the injection of content and style in the target branch, which improves the balance between content preservation and style fusion. Various experiments have demonstrated the effectiveness and robustness of our proposed DiffeseST for achieving balanced and controllable style transfer results, as well as the potential to extend to other tasks.

DiffuseST: Unleashing the Capability of the Diffusion Model for Style Transfer

TL;DR

This work proposes a novel and training-free approach for style transfer, combining textual embedding with spatial features and separating the injection of content or style, and adopts the BLIP-2 encoder to extract the textual representation of the style image.

Abstract

Style transfer aims to fuse the artistic representation of a style image with the structural information of a content image. Existing methods train specific networks or utilize pre-trained models to learn content and style features. However, they rely solely on textual or spatial representations that are inadequate to achieve the balance between content and style. In this work, we propose a novel and training-free approach for style transfer, combining textual embedding with spatial features and separating the injection of content or style. Specifically, we adopt the BLIP-2 encoder to extract the textual representation of the style image. We utilize the DDIM inversion technique to extract intermediate embeddings in content and style branches as spatial features. Finally, we harness the step-by-step property of diffusion models by separating the injection of content and style in the target branch, which improves the balance between content preservation and style fusion. Various experiments have demonstrated the effectiveness and robustness of our proposed DiffeseST for achieving balanced and controllable style transfer results, as well as the potential to extend to other tasks.

Paper Structure

This paper contains 34 sections, 12 equations, 21 figures, 4 tables, 1 algorithm.

Figures (21)

  • Figure 1: Style transfer (column 3) and extended image editing (column 4) results by our method.
  • Figure 2: Overall framework of DiffuseST. The target branch is to perform style transfer guided by textual and spatial representations of two images. We adopt the BLIP-2 encoder to produce text-aligned features of the style image. We utilize the DDIM inversion technique and extract inner spatial features in the content and style branches, respectively. The content and style spatial injections are separated at different steps in the target branch to achieve balanced stylization.
  • Figure 3: We compare our method with advanced style transfer methods. The content and style images are given in the first and second columns. Our method produces harmonious and high-quality stylized images, while others are less aesthetic due to the lack of balance between content and style representation.
  • Figure 4: Ablation study on $\alpha$ to control the injection proportion of content and style. Larger $\alpha$ determines more steps in the target branch for style injection.
  • Figure 5: Ablation study of textual and spatial representations. Together with textual condition, content injection (CI), and style injection (SI), DiffuseST (the last column) achieves harmonious and balanced stylization.
  • ...and 16 more figures