Table of Contents
Fetching ...

StyleBrush: Style Extraction and Transfer from a Single Image

Wancheng Feng, Wanquan Feng, Dawei Huang, Jiaming Pei, Guangliang Cheng, Lukun Wang

TL;DR

StyleBrush tackles reference-image guided stylization with a diffusion-based, two-branch framework that separates style from content structure. A ReferenceNet extracts style while a Structure Guider preserves geometry, using grayscale+blur inputs and cross-attention within a Stable Diffusion backbone, with video extension via Animatediff and a controllable style-strength parameter. The authors construct a 100K high-quality style image dataset through LLM-generated prompts and Kolors, enabling training from single-image crops and without per-style optimization. Empirical results on qualitative, quantitative, and user-based metrics show state-of-the-art performance and strong video consistency, highlighting the method's practicality for image and video stylization with flexible strength control.

Abstract

Stylization for visual content aims to add specific style patterns at the pixel level while preserving the original structural features. Compared with using predefined styles, stylization guided by reference style images is more challenging, where the main difficulty is to effectively separate style from structural elements. In this paper, we propose StyleBrush, a method that accurately captures styles from a reference image and ``brushes'' the extracted style onto other input visual content. Specifically, our architecture consists of two branches: ReferenceNet, which extracts style from the reference image, and Structure Guider, which extracts structural features from the input image, thus enabling image-guided stylization. We utilize LLM and T2I models to create a dataset comprising 100K high-quality style images, encompassing a diverse range of styles and contents with high aesthetic score. To construct training pairs, we crop different regions of the same training image. Experiments show that our approach achieves state-of-the-art results through both qualitative and quantitative analyses. We will release our code and dataset upon acceptance of the paper.

StyleBrush: Style Extraction and Transfer from a Single Image

TL;DR

StyleBrush tackles reference-image guided stylization with a diffusion-based, two-branch framework that separates style from content structure. A ReferenceNet extracts style while a Structure Guider preserves geometry, using grayscale+blur inputs and cross-attention within a Stable Diffusion backbone, with video extension via Animatediff and a controllable style-strength parameter. The authors construct a 100K high-quality style image dataset through LLM-generated prompts and Kolors, enabling training from single-image crops and without per-style optimization. Empirical results on qualitative, quantitative, and user-based metrics show state-of-the-art performance and strong video consistency, highlighting the method's practicality for image and video stylization with flexible strength control.

Abstract

Stylization for visual content aims to add specific style patterns at the pixel level while preserving the original structural features. Compared with using predefined styles, stylization guided by reference style images is more challenging, where the main difficulty is to effectively separate style from structural elements. In this paper, we propose StyleBrush, a method that accurately captures styles from a reference image and ``brushes'' the extracted style onto other input visual content. Specifically, our architecture consists of two branches: ReferenceNet, which extracts style from the reference image, and Structure Guider, which extracts structural features from the input image, thus enabling image-guided stylization. We utilize LLM and T2I models to create a dataset comprising 100K high-quality style images, encompassing a diverse range of styles and contents with high aesthetic score. To construct training pairs, we crop different regions of the same training image. Experiments show that our approach achieves state-of-the-art results through both qualitative and quantitative analyses. We will release our code and dataset upon acceptance of the paper.
Paper Structure (20 sections, 3 equations, 6 figures, 1 table)

This paper contains 20 sections, 3 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: We propose StyleBrush, a framework that transfers style from only a single reference style image. The gallery above shows our results on various styles and diverse content images. The $1,4-th$ rows are the contents, the $2,5-th$ rows show the styles, and the $3,6-th$ rows are the results.
  • Figure 2: Overall of StyleBrush pipline. The Structure Guider processes the content image, while the reference image is handled by the CLIP encoder and VAE encoder modules. Noise is added to the content image, which is then passed through the Denoising UNet along with features from the ReferenceNet. ReferenceNet integrates style and CLIP features. Finally, the combined features in the Denoising UNet are passed to the VAE decoder to generate the output image, effectively blending the structure and style of the input images.
  • Figure 3: StyleBrush training process. The images are randomly cropped to generate both content and reference images. The content image is then transformed into a structural image and input into the trainable Structure Guider module. The reference image is processed separately using frozen CLIP and VAE models. The StyleBrush denoising model combines these inputs to produce the final denoised image. Modules marked with a flame icon are trainable, while those marked with a snowflake icon remain frozen during training.
  • Figure 4: The qualitative comparison between our method and previous approaches, including CAP-VSTNET, IEContraAST, StyleFormer, StyleID, StyleTR2, InstantStyle, and IP-Adapter. As shown in the figure, our method achieves satisfactory stylization effect while maintaining the structure well.
  • Figure 5: Results for various style strength values, ranging from $[0, 1]$. When the style strength is close to 0, the result is very similar to the content image; when the style strength is close to 1, the style of the result becomes consistent with the reference image.
  • ...and 1 more figures