Table of Contents
Fetching ...

StyleTokenizer: Defining Image Style by a Single Instance for Controlling Diffusion Models

Wen Li, Muyuan Fang, Cheng Zou, Biao Gong, Ruobing Zheng, Meng Wang, Jingdong Chen, Ming Yang

TL;DR

StyleTokenizer tackles the challenge of precise style control in diffusion-based image generation by aligning style representations with text embeddings in a shared semantic space. It introduces a two-stage framework: a Style Encoder trained on Style30K to extract style cues as a latent embedding $f_s$, and a Style Tokenizer $T_s$ that maps $f_s$ to style tokens $e_s$ aligned with text tokens $e_t$, enabling joint conditioning in Stable Diffusion with independent text and style guidance. The Style30K dataset, the style encoder, and the tokenizer together enable zero-shot style control from a single reference image while preserving text-prompt effectiveness, demonstrated through extensive qualitative, quantitative, and ablation studies. The approach offers a scalable, training-free pathway to robust, controllable stylization in diffusion models, with public release of code and data to support further research and applications.

Abstract

Despite the burst of innovative methods for controlling the diffusion process, effectively controlling image styles in text-to-image generation remains a challenging task. Many adapter-based methods impose image representation conditions on the denoising process to accomplish image control. However these conditions are not aligned with the word embedding space, leading to interference between image and text control conditions and the potential loss of semantic information from the text prompt. Addressing this issue involves two key challenges. Firstly, how to inject the style representation without compromising the effectiveness of text representation in control. Secondly, how to obtain the accurate style representation from a single reference image. To tackle these challenges, we introduce StyleTokenizer, a zero-shot style control image generation method that aligns style representation with text representation using a style tokenizer. This alignment effectively minimizes the impact on the effectiveness of text prompts. Furthermore, we collect a well-labeled style dataset named Style30k to train a style feature extractor capable of accurately representing style while excluding other content information. Experimental results demonstrate that our method fully grasps the style characteristics of the reference image, generating appealing images that are consistent with both the target image style and text prompt. The code and dataset are available at https://github.com/alipay/style-tokenizer.

StyleTokenizer: Defining Image Style by a Single Instance for Controlling Diffusion Models

TL;DR

StyleTokenizer tackles the challenge of precise style control in diffusion-based image generation by aligning style representations with text embeddings in a shared semantic space. It introduces a two-stage framework: a Style Encoder trained on Style30K to extract style cues as a latent embedding , and a Style Tokenizer that maps to style tokens aligned with text tokens , enabling joint conditioning in Stable Diffusion with independent text and style guidance. The Style30K dataset, the style encoder, and the tokenizer together enable zero-shot style control from a single reference image while preserving text-prompt effectiveness, demonstrated through extensive qualitative, quantitative, and ablation studies. The approach offers a scalable, training-free pathway to robust, controllable stylization in diffusion models, with public release of code and data to support further research and applications.

Abstract

Despite the burst of innovative methods for controlling the diffusion process, effectively controlling image styles in text-to-image generation remains a challenging task. Many adapter-based methods impose image representation conditions on the denoising process to accomplish image control. However these conditions are not aligned with the word embedding space, leading to interference between image and text control conditions and the potential loss of semantic information from the text prompt. Addressing this issue involves two key challenges. Firstly, how to inject the style representation without compromising the effectiveness of text representation in control. Secondly, how to obtain the accurate style representation from a single reference image. To tackle these challenges, we introduce StyleTokenizer, a zero-shot style control image generation method that aligns style representation with text representation using a style tokenizer. This alignment effectively minimizes the impact on the effectiveness of text prompts. Furthermore, we collect a well-labeled style dataset named Style30k to train a style feature extractor capable of accurately representing style while excluding other content information. Experimental results demonstrate that our method fully grasps the style characteristics of the reference image, generating appealing images that are consistent with both the target image style and text prompt. The code and dataset are available at https://github.com/alipay/style-tokenizer.
Paper Structure (17 sections, 2 equations, 8 figures, 3 tables)

This paper contains 17 sections, 2 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Some showcases of StyleTokenizer. It is capable of generating images with corresponding styles using a single style image reference. For each image pair, the smaller one is a style reference, and the larger one is a generated image conditioned by the corresponding style reference and text prompt on the bottom.
  • Figure 2: The difference between with adapter-based methods.Left: Adapter-based methods directly inject image representation in a similar manner with text representation, leading to interference between the two control conditions and loss of semantic information from the text prompt. Right: StyleTokenizer aligns style representation with text representation into a common semantic space, which minimizes the impact on the effectiveness of text prompts.
  • Figure 3: Overview of StyleTokenizer. Our method consists of two stages. In the first stage, a Style Encoder is trained on a style dataset to acquire style representation capabilities. We employ contrastive learning to enforce it to focus on the distance differences between diverse styles for better style representation. In the second stage, style embedding is extracted from a single image by a Style Encoder, and then a Style Tokenizer converts it into style tokens, which are aligned with text tokens in the word embedding space. Finally, these tokens are input to the SD pipeline as a condition to generate the image.
  • Figure 4: Partial style images in Style30k dataset. Each dotted box represents a style category. The Style30K is a style-focused dataset with over 300 style categories and 30,000 images, all professionally annotated. The number of images in each category ranges from 30 to 200. These categories cover a diverse range of fields, including art styles, commercial design styles, 3D modeling, etc.
  • Figure 5: Visual comparison with the competing methods. Each column represents the results generated by different methods using the same prompt and reference image. Each row represents the result generated by a different method. Images in the first row are style references used to control style.
  • ...and 3 more figures