Mixture of Style Experts for Diverse Image Stylization

Shihao Zhu; Ziheng Ouyang; Yijia Kang; Qilong Wang; Mi Zhou; Bo Li; Ming-Ming Cheng; Qibin Hou

Mixture of Style Experts for Diverse Image Stylization

Shihao Zhu, Ziheng Ouyang, Yijia Kang, Qilong Wang, Mi Zhou, Bo Li, Ming-Ming Cheng, Qibin Hou

Abstract

Diffusion-based stylization has advanced significantly, yet existing methods are limited to color-driven transformations, neglecting complex semantics and material details.We introduce StyleExpert, a semantic-aware framework based on the Mixture of Experts (MoE). Our framework employs a unified style encoder, trained on our large-scale dataset of content-style-stylized triplets, to embed diverse styles into a consistent latent space. This embedding is then used to condition a similarity-aware gating mechanism, which dynamically routes styles to specialized experts within the MoE architecture. Leveraging this MoE architecture, our method adeptly handles diverse styles spanning multiple semantic levels, from shallow textures to deep semantics. Extensive experiments show that StyleExpert outperforms existing approaches in preserving semantics and material details, while generalizing to unseen styles. Our code and collected images are available at the project page: https://hh-lg.github.io/StyleExpert-Page/.

Mixture of Style Experts for Diverse Image Stylization

Abstract

Paper Structure (22 sections, 14 equations, 15 figures, 5 tables)

This paper contains 22 sections, 14 equations, 15 figures, 5 tables.

Introduction
Related Work
Method
Preliminaries
Style Representation Encoder
Efficient MoE Fine-tuning for Style Transfer
Stylized Dataset Curation
Experiments
Experiment Settings
Qualitative Comparisons
Quantitative Comparisons
Ablation Study
Discussions and Conclusions
Qwen Semantic Score Details
Experiments Details
...and 7 more sections

Figures (15)

Figure 1: Overview of the proposed StyleExpert. Our method comprises two training stages. In the first stage (Sec. \ref{['sec:style_encoder']}), a style encoder is trained with the InfoNCE loss to learn discriminative style representations, thereby accelerating convergence. In the second stage (Sec. \ref{['sec:moe']}), the pre-trained encoder provides style priors to guide the router network in training MoE LoRA adapters, enabling each layer to dynamically select the most suitable experts for diverse styles.
Figure 2: Overview of StyleExpert-500K. (a) illustrates the hierarchical distribution of all 209 styles in StyleExpert-500K, where bar heights indicate the number of styles per category. (b) presents examples of nine different stylizations for the same content from our StyleExpert-500K dataset. (c) compares the focus on semantic stylization between StyleExpert-500K and OmniStyle-150K.
Figure 3: Overview of our dataset curation pipeline. We first manually collect content images and style LoRAs from the web. For each content image, we use Qwen qwen to generate a clean caption that excludes color and style information. The content image, its caption, and the corresponding style LoRA are then fed into the model to generate stylized images. A vision-language model (VLM) filters out those with incorrect or failed stylization. Finally, we compute CLIP radford2021learning similarity to select the most suitable style reference for each target, forming the final triplet dataset.
Figure 4: Qualitative comparison of our method, StyleExpert, with other SOTA style transfer methods on unseen styles.
Figure 5: Impact of the style encoder on MoE Training.
...and 10 more figures

Mixture of Style Experts for Diverse Image Stylization

Abstract

Mixture of Style Experts for Diverse Image Stylization

Authors

Abstract

Table of Contents

Figures (15)