Table of Contents
Fetching ...

Visual Style Prompting with Swapping Self-Attention

Jaeseok Jeong, Junho Kim, Yunjey Choi, Gayoung Lee, Youngjung Uh

TL;DR

The paper addresses controlled visual style transfer in text-to-image diffusion without model retraining. It introduces visual style prompting via swapping late self-attention keys and values with those from a reference image, preserving the content dictated by text prompts while adopting the reference style. Across extensive evaluations, the method achieves strong style fidelity with minimal content leakage and maintains alignment to prompts, outperforming several training-based and prompting baselines. The approach is compatible with existing conditioning techniques and extends to real images through inversion, offering practical, training-free style control with broad applicability and ethical considerations.

Abstract

In the evolving domain of text-to-image generation, diffusion models have emerged as powerful tools in content creation. Despite their remarkable capability, existing models still face challenges in achieving controlled generation with a consistent style, requiring costly fine-tuning or often inadequately transferring the visual elements due to content leakage. To address these challenges, we propose a novel approach, \ours, to produce a diverse range of images while maintaining specific style elements and nuances. During the denoising process, we keep the query from original features while swapping the key and value with those from reference features in the late self-attention layers. This approach allows for the visual style prompting without any fine-tuning, ensuring that generated images maintain a faithful style. Through extensive evaluation across various styles and text prompts, our method demonstrates superiority over existing approaches, best reflecting the style of the references and ensuring that resulting images match the text prompts most accurately. Our project page is available https://curryjung.github.io/VisualStylePrompt/.

Visual Style Prompting with Swapping Self-Attention

TL;DR

The paper addresses controlled visual style transfer in text-to-image diffusion without model retraining. It introduces visual style prompting via swapping late self-attention keys and values with those from a reference image, preserving the content dictated by text prompts while adopting the reference style. Across extensive evaluations, the method achieves strong style fidelity with minimal content leakage and maintains alignment to prompts, outperforming several training-based and prompting baselines. The approach is compatible with existing conditioning techniques and extends to real images through inversion, offering practical, training-free style control with broad applicability and ethical considerations.

Abstract

In the evolving domain of text-to-image generation, diffusion models have emerged as powerful tools in content creation. Despite their remarkable capability, existing models still face challenges in achieving controlled generation with a consistent style, requiring costly fine-tuning or often inadequately transferring the visual elements due to content leakage. To address these challenges, we propose a novel approach, \ours, to produce a diverse range of images while maintaining specific style elements and nuances. During the denoising process, we keep the query from original features while swapping the key and value with those from reference features in the late self-attention layers. This approach allows for the visual style prompting without any fine-tuning, ensuring that generated images maintain a faithful style. Through extensive evaluation across various styles and text prompts, our method demonstrates superiority over existing approaches, best reflecting the style of the references and ensuring that resulting images match the text prompts most accurately. Our project page is available https://curryjung.github.io/VisualStylePrompt/.
Paper Structure (26 sections, 2 equations, 22 figures, 1 table)

This paper contains 26 sections, 2 equations, 22 figures, 1 table.

Figures (22)

  • Figure 1: We tackle visual style prompting, reflecting style elements from reference images and contents from text prompts, in a training-free manner.
  • Figure 2: Ambiguity of text prompts. (a) Ambiguity of text leads to different results within the same style description. (b) Even a detailed style description does not guarantee the generation of the same style images since it has many variants that can hardly be constrained using only text prompts. (c) Reference images can specify detailed visual elements.
  • Figure 3: Overview of swapping self-attention for visual style prompting. We swap the key and value features of self-attention block in an original denoising process with the ones from a reference denoising process. This procedure is repeated for T steps, resulting in the original content rendered with the style elements from the reference image.
  • Figure 4: The effect of swapping self-attention across different layers. Swapping self-attention on the bottleneck and downblocks causes content leakage producing cat-like results despite of the dog prompt. Swapping self-attention on downblocks produces disrupted results. We only apply swapping self-attention in the upblocks to appropriately reflect the style elements.
  • Figure 5: Analysis on the optimal range of upblocks for swapping self-attention. We find the optimal range of upblocks for balanced trade-off between different aspects. Please refer to Section \ref{['sec:choosing_blocks']} for details.
  • ...and 17 more figures