Table of Contents
Fetching ...

SigStyle: Signature Style Transfer via Personalized Text-to-Image Models

Ye Wang, Tongyuan Bai, Xuping Xie, Zili Yi, Yilin Wang, Rui Ma

TL;DR

SigStyle tackles signature style transfer from a single reference by learning a dedicated style representation through a hypernetwork that fine-tunes only decoder-attention weights in a personalized diffusion framework. It represents the style as a token (*) and preserves content by performing DDIM inversion on the content image and injecting content-attention priors during the first $k$ denoising steps. The approach yields high-quality global and local transfers, supports texture transfer and style fusion, and enables style-guided text-to-image generation, outperforming several state-of-the-art baselines in both qualitative and quantitative assessments. Overall, SigStyle offers an effective, single-image, parameter-efficient pathway for explicit, controllable preservation of signature-style attributes in diffusion-based synthesis, with potential for broader deployment and more controllable prompts.

Abstract

Style transfer enables the seamless integration of artistic styles from a style image into a content image, resulting in visually striking and aesthetically enriched outputs. Despite numerous advances in this field, existing methods did not explicitly focus on the signature style, which represents the distinct and recognizable visual traits of the image such as geometric and structural patterns, color palettes and brush strokes etc. In this paper, we introduce SigStyle, a framework that leverages the semantic priors that embedded in a personalized text-to-image diffusion model to capture the signature style representation. This style capture process is powered by a hypernetwork that efficiently fine-tunes the diffusion model for any given single style image. Style transfer then is conceptualized as the reconstruction process of content image through learned style tokens from the personalized diffusion model. Additionally, to ensure the content consistency throughout the style transfer process, we introduce a time-aware attention swapping technique that incorporates content information from the original image into the early denoising steps of target image generation. Beyond enabling high-quality signature style transfer across a wide range of styles, SigStyle supports multiple interesting applications, such as local style transfer, texture transfer, style fusion and style-guided text-to-image generation. Quantitative and qualitative evaluations demonstrate our approach outperforms existing style transfer methods for recognizing and transferring the signature styles.

SigStyle: Signature Style Transfer via Personalized Text-to-Image Models

TL;DR

SigStyle tackles signature style transfer from a single reference by learning a dedicated style representation through a hypernetwork that fine-tunes only decoder-attention weights in a personalized diffusion framework. It represents the style as a token (*) and preserves content by performing DDIM inversion on the content image and injecting content-attention priors during the first denoising steps. The approach yields high-quality global and local transfers, supports texture transfer and style fusion, and enables style-guided text-to-image generation, outperforming several state-of-the-art baselines in both qualitative and quantitative assessments. Overall, SigStyle offers an effective, single-image, parameter-efficient pathway for explicit, controllable preservation of signature-style attributes in diffusion-based synthesis, with potential for broader deployment and more controllable prompts.

Abstract

Style transfer enables the seamless integration of artistic styles from a style image into a content image, resulting in visually striking and aesthetically enriched outputs. Despite numerous advances in this field, existing methods did not explicitly focus on the signature style, which represents the distinct and recognizable visual traits of the image such as geometric and structural patterns, color palettes and brush strokes etc. In this paper, we introduce SigStyle, a framework that leverages the semantic priors that embedded in a personalized text-to-image diffusion model to capture the signature style representation. This style capture process is powered by a hypernetwork that efficiently fine-tunes the diffusion model for any given single style image. Style transfer then is conceptualized as the reconstruction process of content image through learned style tokens from the personalized diffusion model. Additionally, to ensure the content consistency throughout the style transfer process, we introduce a time-aware attention swapping technique that incorporates content information from the original image into the early denoising steps of target image generation. Beyond enabling high-quality signature style transfer across a wide range of styles, SigStyle supports multiple interesting applications, such as local style transfer, texture transfer, style fusion and style-guided text-to-image generation. Quantitative and qualitative evaluations demonstrate our approach outperforms existing style transfer methods for recognizing and transferring the signature styles.

Paper Structure

This paper contains 27 sections, 4 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Our method can achieve high-quality global style transfer (a) while keeping the signature style such as distinct and recognizable visual traits like geometric and structural patterns, color palettes and brush strokes etc. Also, our method is flexible and supports local style transfer (b), style-guided text-to-image generation (c), and texture transfer (d). Best viewed in color.
  • Figure 2: Signature style transfer comparison with SOTA methods on two complex style references.
  • Figure 3: The SigStyle framework. First, given a style image, we perform hypernetwork-powered style-aware fine-tuning for style inversion and represent the reference style as a special token * (see Figure \ref{['fig:method']}.a). In Figure \ref{['fig:method']}.b, the upper branch represents the reconstruction process of the content image, while the lower branch represents the generation process of the target image. When generating the target image using a pre-trained model and target text, we first use DDIM Inversion to map the content image into noise latents, which are then copied as the initial noise for generating the target image. Then, we adopt time-aware attention swapping to inject structural and content information during the first $k$ steps of the denoising process (see Figure b). In the subsequent $T-k$ steps, we proceed with the usual denoising process without any swapping. Finally, by decoding with VAE, we obtain the style-transferred image.
  • Figure 4: Style learning preferences analysis of UNet's encoder and decoder.
  • Figure 5: The architecture of hypernetwork.
  • ...and 7 more figures