Table of Contents
Fetching ...

StyleMaster: Stylize Your Video with Artistic Generation and Translation

Zixuan Ye, Huijuan Huang, Xintao Wang, Pengfei Wan, Di Zhang, Wenhan Luo

TL;DR

StyleMaster tackles the challenge of transferring reference style to videos while preserving local texture and avoiding content leakage. It introduces a dual-stage style extractor with a global descriptor $F_{global}$ obtained via a post-CLIP projection and a texture descriptor $F_{texture}$ selected from CLIP patches, fused through dual-cross-attention. A motion adapter with a LoRA-based update $\widetilde{W} = W + \alpha A^{down} A^{up}$ and a grayscale tile ControlNet enable consistent temporal stylization and precise content guidance. Experiments on image style transfer, stylized video generation, and video style transfer show state-of-the-art results on style similarity, text alignment, and motion quality, highlighting practical impact for content-aware video stylization.

Abstract

Style control has been popular in video generation models. Existing methods often generate videos far from the given style, cause content leakage, and struggle to transfer one video to the desired style. Our first observation is that the style extraction stage matters, whereas existing methods emphasize global style but ignore local textures. In order to bring texture features while preventing content leakage, we filter content-related patches while retaining style ones based on prompt-patch similarity; for global style extraction, we generate a paired style dataset through model illusion to facilitate contrastive learning, which greatly enhances the absolute style consistency. Moreover, to fill in the image-to-video gap, we train a lightweight motion adapter on still videos, which implicitly enhances stylization extent, and enables our image-trained model to be seamlessly applied to videos. Benefited from these efforts, our approach, StyleMaster, not only achieves significant improvement in both style resemblance and temporal coherence, but also can easily generalize to video style transfer with a gray tile ControlNet. Extensive experiments and visualizations demonstrate that StyleMaster significantly outperforms competitors, effectively generating high-quality stylized videos that align with textual content and closely resemble the style of reference images. Our project page is at https://zixuan-ye.github.io/stylemaster

StyleMaster: Stylize Your Video with Artistic Generation and Translation

TL;DR

StyleMaster tackles the challenge of transferring reference style to videos while preserving local texture and avoiding content leakage. It introduces a dual-stage style extractor with a global descriptor obtained via a post-CLIP projection and a texture descriptor selected from CLIP patches, fused through dual-cross-attention. A motion adapter with a LoRA-based update and a grayscale tile ControlNet enable consistent temporal stylization and precise content guidance. Experiments on image style transfer, stylized video generation, and video style transfer show state-of-the-art results on style similarity, text alignment, and motion quality, highlighting practical impact for content-aware video stylization.

Abstract

Style control has been popular in video generation models. Existing methods often generate videos far from the given style, cause content leakage, and struggle to transfer one video to the desired style. Our first observation is that the style extraction stage matters, whereas existing methods emphasize global style but ignore local textures. In order to bring texture features while preventing content leakage, we filter content-related patches while retaining style ones based on prompt-patch similarity; for global style extraction, we generate a paired style dataset through model illusion to facilitate contrastive learning, which greatly enhances the absolute style consistency. Moreover, to fill in the image-to-video gap, we train a lightweight motion adapter on still videos, which implicitly enhances stylization extent, and enables our image-trained model to be seamlessly applied to videos. Benefited from these efforts, our approach, StyleMaster, not only achieves significant improvement in both style resemblance and temporal coherence, but also can easily generalize to video style transfer with a gray tile ControlNet. Extensive experiments and visualizations demonstrate that StyleMaster significantly outperforms competitors, effectively generating high-quality stylized videos that align with textual content and closely resemble the style of reference images. Our project page is at https://zixuan-ye.github.io/stylemaster

Paper Structure

This paper contains 26 sections, 8 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: Our StyleMaster demonstrates superior video style transfer and stylized generation. The top section shows our method effectively applying various styles to videos, outperforming VideoComposer wang2024videocomposer and the combination of InstantStyle instantstyle with AnyV2V anyv2v. The bottom highlights our high-quality text-driven stylized synthesis, surpassing VideoComposer wang2024videocomposer and StyleCrafter stylecrafter.
  • Figure 2: Existing image and video stylization methods either fail in keeping local texture or suffer from content leakage. Note: * means StyleCrafter does not support transfer, we use text and reference style image to generate results.
  • Figure 3: Comparison between Style30K with our dataset generated by model illusion. Style30K cannot ensure consistency within a style group (highlighted by the same color), while ours owns absolute consistency.
  • Figure 4: The pipeline of our proposed StyleMaster. We first obtain patch features and image embedding of the style image from CLIP, then we select the patches sharing less similarity with text prompt as texture guidance, and use a global projection module to transform it into global style descriptions. The global projection module is trained with a contrastive dataset constructed by model illusion through contrastive learning. The style information is then injected into the model through the decoupled cross-attention. The motion adapter and gray tile ControlNet are used to enhance dynamic quality and enable content control respectively.
  • Figure 5: Similarity between the extracted global style representations among image patches. Without our global projection, the CLIP image embedding only attends to specific regions; while after the projection, the attention shows an even distribution.
  • ...and 9 more figures