Table of Contents
Fetching ...

DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations

Tianhao Qi, Shancheng Fang, Yanze Wu, Hongtao Xie, Jiawei Liu, Lang Chen, Qian He, Yongdong Zhang

TL;DR

This paper introduces DEADiff, a mecha-nism to decouple the style and semantics of reference images and attains the best visual stylization results and optimal balance between the text controllability inherent in the text-to-image model and style similarity to the reference image, as demonstrated both quantitatively and qualitatively.

Abstract

The diffusion-based text-to-image model harbors immense potential in transferring reference style. However, current encoder-based approaches significantly impair the text controllability of text-to-image models while transferring styles. In this paper, we introduce DEADiff to address this issue using the following two strategies: 1) a mechanism to decouple the style and semantics of reference images. The decoupled feature representations are first extracted by Q-Formers which are instructed by different text descriptions. Then they are injected into mutually exclusive subsets of cross-attention layers for better disentanglement. 2) A non-reconstructive learning method. The Q-Formers are trained using paired images rather than the identical target, in which the reference image and the ground-truth image are with the same style or semantics. We show that DEADiff attains the best visual stylization results and optimal balance between the text controllability inherent in the text-to-image model and style similarity to the reference image, as demonstrated both quantitatively and qualitatively. Our project page is https://tianhao-qi.github.io/DEADiff/.

DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations

TL;DR

This paper introduces DEADiff, a mecha-nism to decouple the style and semantics of reference images and attains the best visual stylization results and optimal balance between the text controllability inherent in the text-to-image model and style similarity to the reference image, as demonstrated both quantitatively and qualitatively.

Abstract

The diffusion-based text-to-image model harbors immense potential in transferring reference style. However, current encoder-based approaches significantly impair the text controllability of text-to-image models while transferring styles. In this paper, we introduce DEADiff to address this issue using the following two strategies: 1) a mechanism to decouple the style and semantics of reference images. The decoupled feature representations are first extracted by Q-Formers which are instructed by different text descriptions. Then they are injected into mutually exclusive subsets of cross-attention layers for better disentanglement. 2) A non-reconstructive learning method. The Q-Formers are trained using paired images rather than the identical target, in which the reference image and the ground-truth image are with the same style or semantics. We show that DEADiff attains the best visual stylization results and optimal balance between the text controllability inherent in the text-to-image model and style similarity to the reference image, as demonstrated both quantitatively and qualitatively. Our project page is https://tianhao-qi.github.io/DEADiff/.
Paper Structure (23 sections, 2 equations, 14 figures, 5 tables)

This paper contains 23 sections, 2 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Given a style reference image, DEADiff is capable of synthesizing new images that resemble the style and are faithful to text prompts simultaneously. However, previous encoder-based methods (i.e., T2I-Adapter mou2023t2i) significantly impair the text controllability of the diffusion-based text-to-image models.
  • Figure 2: The training and inference paradigm of DEADiff. We use proprietary paired datasets for training Q-Former to extract disentangled representations under conditions "style" and "content", which are injected into mutually exclusive cross-attention layers.
  • Figure 3: The illustration of our proposed joint text-image cross-attention layer.
  • Figure 4: Qualitative comparison with the state-of-the-art methods. Zoom in for better visualization.
  • Figure 5: Visual comparison between StyleDrop and DEADiff.
  • ...and 9 more figures