Table of Contents
Fetching ...

StyleGallery: Training-free and Semantic-aware Personalized Style Transfer from Arbitrary Image References

Boyu He, Yunfan Ye, Chang Liu, Weishang Wu, Fang Liu, Zhiping Cai

TL;DR

StyleGallery is proposed, a training-free and semantic-aware framework that supports arbitrary reference images as input and enables effective personalized customization and outperforms state-of-the-art methods in content structure preservation, regional stylization, interpretability, and personalized customization, particularly when leveraging multiple style references.

Abstract

Despite the advancements in diffusion-based image style transfer, existing methods are commonly limited by 1) semantic gap: the style reference could miss proper content semantics, causing uncontrollable stylization; 2) reliance on extra constraints (e.g., semantic masks) restricting applicability; 3) rigid feature associations lacking adaptive global-local alignment, failing to balance fine-grained stylization and global content preservation. These limitations, particularly the inability to flexibly leverage style inputs, fundamentally restrict style transfer in terms of personalization, accuracy, and adaptability. To address these, we propose StyleGallery, a training-free and semantic-aware framework that supports arbitrary reference images as input and enables effective personalized customization. It comprises three core stages: semantic region segmentation (adaptive clustering on latent diffusion features to divide regions without extra inputs); clustered region matching (block filtering on extracted features for precise alignment); and style transfer optimization (energy function-guided diffusion sampling with regional style loss to optimize stylization). Experiments on our introduced benchmark demonstrate that StyleGallery outperforms state-of-the-art methods in content structure preservation, regional stylization, interpretability, and personalized customization, particularly when leveraging multiple style references.

StyleGallery: Training-free and Semantic-aware Personalized Style Transfer from Arbitrary Image References

TL;DR

StyleGallery is proposed, a training-free and semantic-aware framework that supports arbitrary reference images as input and enables effective personalized customization and outperforms state-of-the-art methods in content structure preservation, regional stylization, interpretability, and personalized customization, particularly when leveraging multiple style references.

Abstract

Despite the advancements in diffusion-based image style transfer, existing methods are commonly limited by 1) semantic gap: the style reference could miss proper content semantics, causing uncontrollable stylization; 2) reliance on extra constraints (e.g., semantic masks) restricting applicability; 3) rigid feature associations lacking adaptive global-local alignment, failing to balance fine-grained stylization and global content preservation. These limitations, particularly the inability to flexibly leverage style inputs, fundamentally restrict style transfer in terms of personalization, accuracy, and adaptability. To address these, we propose StyleGallery, a training-free and semantic-aware framework that supports arbitrary reference images as input and enables effective personalized customization. It comprises three core stages: semantic region segmentation (adaptive clustering on latent diffusion features to divide regions without extra inputs); clustered region matching (block filtering on extracted features for precise alignment); and style transfer optimization (energy function-guided diffusion sampling with regional style loss to optimize stylization). Experiments on our introduced benchmark demonstrate that StyleGallery outperforms state-of-the-art methods in content structure preservation, regional stylization, interpretability, and personalized customization, particularly when leveraging multiple style references.
Paper Structure (30 sections, 11 equations, 22 figures, 7 tables)

This paper contains 30 sections, 11 equations, 22 figures, 7 tables.

Figures (22)

  • Figure 1: Overall composition of the style dataset.
  • Figure 2: Overall framework. Our pipeline comprises three stages: (a) In stage 1, the content image is diffused for T steps to extract UNet features ${F_0,\dots,F_T}$, which are weighted into $F_{mix}$, then clustered via PCA and K-means to optimize the mask. (b) In stage 2, $F_{mix}$ and the semantic mask identify cluster features, which are aggregated via self-attention for statistical similarity. Meanwhile, DINOv2 Oquab2023DINOv2LR splits $x_0$ into blocks; tokens are filtered for semantic similarity, while cluster positions yield positional similarity. (c) Stage 3 optimizes the generation through $N$ latent sampling steps. UNet attention maps are extracted, sparsified using the semantic mask, and recombined with style features ($K_s$, $V_s$). L1 distances are then computed between the actual and combined feature maps, as well as the $Q$ and $Q_c$. These losses guide the final image generation.
  • Figure 2: Category distribution in style dataset.
  • Figure 3: Cluster optimization. We compute pairwise semantic distances among clusters, merge those below a threshold (set to 0.85), then split–recombine clusters guided by the input’s depth features, then traverse each pixel, eliminate isolated points.
  • Figure 3: Ablations on cluster merge threshold.
  • ...and 17 more figures