Table of Contents
Fetching ...

StyleShot: A Snapshot on Any Style

Junyao Gao, Yanchen Liu, Yanan Sun, Yinhao Tang, Yanhong Zeng, Kai Chen, Cairong Zhao

TL;DR

StyleShot addresses the challenge of generalized, test-time tuning-free style transfer by introducing a dedicated style-aware encoder that leverages multi-scale patch embeddings and Mixture-of-Experts, paired with a content-fusion encoder to decouple content from style. A style-balanced StyleGallery and a new StyleBench benchmark enable robust learning and evaluation across open-domain styles, from 3D and flat to fine-grained textures. The approach achieves state-of-the-art performance in both text-driven and image-driven stylization without test-time style tuning, demonstrated through qualitative visuals, human preferences, and CLIP-based metrics. These contributions provide a practical, scalable pathway for flexible, high-fidelity style transfer on diffusion-based image generators. The work also emphasizes the importance of balanced training data and explicit decoupling of content and style for generalization.

Abstract

In this paper, we show that, a good style representation is crucial and sufficient for generalized style transfer without test-time tuning. We achieve this through constructing a style-aware encoder and a well-organized style dataset called StyleGallery. With dedicated design for style learning, this style-aware encoder is trained to extract expressive style representation with decoupling training strategy, and StyleGallery enables the generalization ability. We further employ a content-fusion encoder to enhance image-driven style transfer. We highlight that, our approach, named StyleShot, is simple yet effective in mimicking various desired styles, i.e., 3D, flat, abstract or even fine-grained styles, without test-time tuning. Rigorous experiments validate that, StyleShot achieves superior performance across a wide range of styles compared to existing state-of-the-art methods. The project page is available at: https://styleshot.github.io/.

StyleShot: A Snapshot on Any Style

TL;DR

StyleShot addresses the challenge of generalized, test-time tuning-free style transfer by introducing a dedicated style-aware encoder that leverages multi-scale patch embeddings and Mixture-of-Experts, paired with a content-fusion encoder to decouple content from style. A style-balanced StyleGallery and a new StyleBench benchmark enable robust learning and evaluation across open-domain styles, from 3D and flat to fine-grained textures. The approach achieves state-of-the-art performance in both text-driven and image-driven stylization without test-time style tuning, demonstrated through qualitative visuals, human preferences, and CLIP-based metrics. These contributions provide a practical, scalable pathway for flexible, high-fidelity style transfer on diffusion-based image generators. The work also emphasizes the importance of balanced training data and explicit decoupling of content and style for generalization.

Abstract

In this paper, we show that, a good style representation is crucial and sufficient for generalized style transfer without test-time tuning. We achieve this through constructing a style-aware encoder and a well-organized style dataset called StyleGallery. With dedicated design for style learning, this style-aware encoder is trained to extract expressive style representation with decoupling training strategy, and StyleGallery enables the generalization ability. We further employ a content-fusion encoder to enhance image-driven style transfer. We highlight that, our approach, named StyleShot, is simple yet effective in mimicking various desired styles, i.e., 3D, flat, abstract or even fine-grained styles, without test-time tuning. Rigorous experiments validate that, StyleShot achieves superior performance across a wide range of styles compared to existing state-of-the-art methods. The project page is available at: https://styleshot.github.io/.
Paper Structure (26 sections, 8 equations, 63 figures, 6 tables)

This paper contains 26 sections, 8 equations, 63 figures, 6 tables.

Figures (63)

  • Figure 1: Visualization results of StyleShot for text and image-driven style transfer across six style reference images. Each stylized image is generated by StyleShot without test-time style-tuning, capturing numerous nuances such as colors, textures, illumination and layout.
  • Figure 2: Illustration of style extraction between CLIP image encoder (a) and our style-aware encoder (b).
  • Figure 3: The overall architecture of our proposed StyleShot.
  • Figure 4: Attention map from the CLIP image encoder on style reference images.
  • Figure 5: Illustration of the content input under different setting.
  • ...and 58 more figures