Table of Contents
Fetching ...

V-Shuffle: Zero-Shot Style Transfer via Value Shuffle

Haojun Tang, Qiwei Lin, Tongda Xu, Lida Huang, Yan Wang

TL;DR

V-Shuffle tackles zero-shot style transfer by using multiple style images from the same domain and disrupting semantic content through value-feature shuffling in diffusion self-attention. It introduces a Hybrid Style Regularization scheme that combines mid-diffusion low-level style cues with high-level textures to balance content preservation and style fidelity. Empirical results on AST and Sim2Real show strong performance, with multi-image inputs yielding better trade-offs and single-image transfers outperforming prior state-of-the-art. The method eschews additional fine-tuning, offering a practical and scalable approach to high-quality style transfer in diffusion-based frameworks.

Abstract

Attention injection-based style transfer has achieved remarkable progress in recent years. However, existing methods often suffer from content leakage, where the undesired semantic content of the style image mistakenly appears in the stylized output. In this paper, we propose V-Shuffle, a zero-shot style transfer method that leverages multiple style images from the same style domain to effectively navigate the trade-off between content preservation and style fidelity. V-Shuffle implicitly disrupts the semantic content of the style images by shuffling the value features within the self-attention layers of the diffusion model, thereby preserving low-level style representations. We further introduce a Hybrid Style Regularization that complements these low-level representations with high-level style textures to enhance style fidelity. Empirical results demonstrate that V-Shuffle achieves excellent performance when utilizing multiple style images. Moreover, when applied to a single style image, V-Shuffle outperforms previous state-of-the-art methods.

V-Shuffle: Zero-Shot Style Transfer via Value Shuffle

TL;DR

V-Shuffle tackles zero-shot style transfer by using multiple style images from the same domain and disrupting semantic content through value-feature shuffling in diffusion self-attention. It introduces a Hybrid Style Regularization scheme that combines mid-diffusion low-level style cues with high-level textures to balance content preservation and style fidelity. Empirical results on AST and Sim2Real show strong performance, with multi-image inputs yielding better trade-offs and single-image transfers outperforming prior state-of-the-art. The method eschews additional fine-tuning, offering a practical and scalable approach to high-quality style transfer in diffusion-based frameworks.

Abstract

Attention injection-based style transfer has achieved remarkable progress in recent years. However, existing methods often suffer from content leakage, where the undesired semantic content of the style image mistakenly appears in the stylized output. In this paper, we propose V-Shuffle, a zero-shot style transfer method that leverages multiple style images from the same style domain to effectively navigate the trade-off between content preservation and style fidelity. V-Shuffle implicitly disrupts the semantic content of the style images by shuffling the value features within the self-attention layers of the diffusion model, thereby preserving low-level style representations. We further introduce a Hybrid Style Regularization that complements these low-level representations with high-level style textures to enhance style fidelity. Empirical results demonstrate that V-Shuffle achieves excellent performance when utilizing multiple style images. Moreover, when applied to a single style image, V-Shuffle outperforms previous state-of-the-art methods.

Paper Structure

This paper contains 19 sections, 6 equations, 16 figures, 2 tables, 3 algorithms.

Figures (16)

  • Figure 1: Image style transfer results by the proposed V-Shuffle. (a) Comparison between baselines and our V-Shuffle on single image style transfer. (b) Results of our V-Shuffle with a few style images. Best viewed in zoomed-in mode.
  • Figure 2: An example of LoRA-based style transfer. Both K-LoRA K-lora and Zip-LoRA Ziplora tend to preserve only high-level subject semantics while failing to maintain strict structural correspondence with the content image.
  • Figure 3: PCA of $V_{s_{1:n}}^t$ features and visualization of stylized output. Columns 3-4: content leakage; columns 5-6: low-level style representation; column 7: better results. The top row corresponds to $n=3$, while the bottom row corresponds to $n=1$. Best viewed in zoomed-in mode.
  • Figure 4: Overview of V-Shuffle. We first extract $Q_c^t$ for $I_c$ from the self-attention block of $\epsilon_{\theta}$, as well as $K_{s_{1:n}}^t$ and $V_{s_{1:n}}^t$ for $I_{s_{1:n}}$. To mitigate content leakage, we shuffle $V_{s_{1:n}}^t$ to obtain $V_{s_{1:n}}^{t\#}$. We then apply Hybrid Style Regularization to navigate the trade-off between style fidelity and content preservation, optimizing $z_T^{cs}$ for $T$ iterations using $\mathcal{L}_{HSR}$. Finally, we generate the stylized image $I_{cs} = \mathcal{D}(z_0^{cs})$.
  • Figure 5: A toy experiment illustrates that only shuffling along the sequence dimension $s$ alleviates content leakage, though at the expense of partially degrading style fidelity. Here $n=1$. Best viewed in zoomed-in mode.
  • ...and 11 more figures