V-Shuffle: Zero-Shot Style Transfer via Value Shuffle
Haojun Tang, Qiwei Lin, Tongda Xu, Lida Huang, Yan Wang
TL;DR
V-Shuffle tackles zero-shot style transfer by using multiple style images from the same domain and disrupting semantic content through value-feature shuffling in diffusion self-attention. It introduces a Hybrid Style Regularization scheme that combines mid-diffusion low-level style cues with high-level textures to balance content preservation and style fidelity. Empirical results on AST and Sim2Real show strong performance, with multi-image inputs yielding better trade-offs and single-image transfers outperforming prior state-of-the-art. The method eschews additional fine-tuning, offering a practical and scalable approach to high-quality style transfer in diffusion-based frameworks.
Abstract
Attention injection-based style transfer has achieved remarkable progress in recent years. However, existing methods often suffer from content leakage, where the undesired semantic content of the style image mistakenly appears in the stylized output. In this paper, we propose V-Shuffle, a zero-shot style transfer method that leverages multiple style images from the same style domain to effectively navigate the trade-off between content preservation and style fidelity. V-Shuffle implicitly disrupts the semantic content of the style images by shuffling the value features within the self-attention layers of the diffusion model, thereby preserving low-level style representations. We further introduce a Hybrid Style Regularization that complements these low-level representations with high-level style textures to enhance style fidelity. Empirical results demonstrate that V-Shuffle achieves excellent performance when utilizing multiple style images. Moreover, when applied to a single style image, V-Shuffle outperforms previous state-of-the-art methods.
