Table of Contents
Fetching ...

D2Styler: Advancing Arbitrary Style Transfer with Discrete Diffusion Methods

Onkar Susladkar, Gayatri Deshmukh, Sparsh Mittal, Parth Shastri

TL;DR

This work tackles arbitrary style transfer (AST) by addressing mode collapse and content distortion through a new discrete-diffusion framework, $D^{2}$Styler, that operates in the discrete latent space of a VQ-GAN encoder. It introduces TransDiffuser, a transformer-based diffusion prior conditioned on AdaIN features derived from content and style statistics, and it uses a two-stage pipeline where Stage 1 encodes to discrete latents and Stage 2 decodes with perceptual losses including $L_{style}$, $L_{cont}$, and a novel $L_{feat}$ aligning AdaIN with the generated image. The approach achieves state-of-the-art results on WikiArt-COCO benchmarks, surpassing twelve baselines on metrics like GM, SSIM, LPIPS, and PD, with 78M parameters and reduced inference time, while enabling multi-style transfer via linear AdaIN feature combinations. Ablation studies validate the importance of each component, including CNN-based encoders, the four loss terms, and diffusion-step choices, demonstrating robust performance with fewer diffusion steps than traditional diffusion models. Overall, $D^{2}$Styler offers high-quality, controllable AST with practical runtime, enabling applications in complex digital art and editing tasks and paving the way for broader diffusion-based image manipulation in discrete latent spaces.

Abstract

In image processing, one of the most challenging tasks is to render an image's semantic meaning using a variety of artistic approaches. Existing techniques for arbitrary style transfer (AST) frequently experience mode-collapse, over-stylization, or under-stylization due to a disparity between the style and content images. We propose a novel framework called D$^2$Styler (Discrete Diffusion Styler) that leverages the discrete representational capability of VQ-GANs and the advantages of discrete diffusion, including stable training and avoidance of mode collapse. Our method uses Adaptive Instance Normalization (AdaIN) features as a context guide for the reverse diffusion process. This makes it easy to move features from the style image to the content image without bias. The proposed method substantially enhances the visual quality of style-transferred images, allowing the combination of content and style in a visually appealing manner. We take style images from the WikiArt dataset and content images from the COCO dataset. Experimental results demonstrate that D$^2$Styler produces high-quality style-transferred images and outperforms twelve existing methods on nearly all the metrics. The qualitative results and ablation studies provide further insights into the efficacy of our technique. The code is available at https://github.com/Onkarsus13/D2Styler.

D2Styler: Advancing Arbitrary Style Transfer with Discrete Diffusion Methods

TL;DR

This work tackles arbitrary style transfer (AST) by addressing mode collapse and content distortion through a new discrete-diffusion framework, Styler, that operates in the discrete latent space of a VQ-GAN encoder. It introduces TransDiffuser, a transformer-based diffusion prior conditioned on AdaIN features derived from content and style statistics, and it uses a two-stage pipeline where Stage 1 encodes to discrete latents and Stage 2 decodes with perceptual losses including , , and a novel aligning AdaIN with the generated image. The approach achieves state-of-the-art results on WikiArt-COCO benchmarks, surpassing twelve baselines on metrics like GM, SSIM, LPIPS, and PD, with 78M parameters and reduced inference time, while enabling multi-style transfer via linear AdaIN feature combinations. Ablation studies validate the importance of each component, including CNN-based encoders, the four loss terms, and diffusion-step choices, demonstrating robust performance with fewer diffusion steps than traditional diffusion models. Overall, Styler offers high-quality, controllable AST with practical runtime, enabling applications in complex digital art and editing tasks and paving the way for broader diffusion-based image manipulation in discrete latent spaces.

Abstract

In image processing, one of the most challenging tasks is to render an image's semantic meaning using a variety of artistic approaches. Existing techniques for arbitrary style transfer (AST) frequently experience mode-collapse, over-stylization, or under-stylization due to a disparity between the style and content images. We propose a novel framework called DStyler (Discrete Diffusion Styler) that leverages the discrete representational capability of VQ-GANs and the advantages of discrete diffusion, including stable training and avoidance of mode collapse. Our method uses Adaptive Instance Normalization (AdaIN) features as a context guide for the reverse diffusion process. This makes it easy to move features from the style image to the content image without bias. The proposed method substantially enhances the visual quality of style-transferred images, allowing the combination of content and style in a visually appealing manner. We take style images from the WikiArt dataset and content images from the COCO dataset. Experimental results demonstrate that DStyler produces high-quality style-transferred images and outperforms twelve existing methods on nearly all the metrics. The qualitative results and ablation studies provide further insights into the efficacy of our technique. The code is available at https://github.com/Onkarsus13/D2Styler.
Paper Structure (12 sections, 11 figures, 5 tables)

This paper contains 12 sections, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Results from $D^{2}Styler$, our proposed method. The content image (the one on the top of the tree) is converted to different stylized versions based on the corresponding style image (shown in the inset).
  • Figure 2: The architecture of the proposed method. The content and style images are encoded using a pretrained VQ-GAN encoder. The encoded input is passed through the diffusion prior conditioned on the AdaIN huang2017arbitrary features. VQ-GAN decoder is then used to obtain the resultant image. The dotted line indicates that the diffusion prior is trained separately from the decoder.
  • Figure 3: (a) The proposed TransDiffuser architecture consists of transformer blocks stacked on each other. The attention query is obtained from the AdaIN block (Section \ref{['sec:adain']}). (b) Transformer blocks follow the traditional architecture NIPS2017_attention_is_all_you_need except for the querying of the AdaIN features.
  • Figure 4: Qualitative results. The numbers below images show GM ($\uparrow$) and SSIM ($\uparrow$).
  • Figure 5: Qualitative results of D$^2$Styler on COCO dataset
  • ...and 6 more figures