Table of Contents
Fetching ...

OmniStyle: Filtering High Quality Style Transfer Data at Scale

Ye Wang, Ruiqi Liu, Jiang Lin, Fei Liu, Zili Yi, Yilin Wang, Rui Ma

TL;DR

OmniStyle addresses key challenges in style transfer by introducing OmniStyle-1M, a large-scale paired dataset spanning 1,000 fine-grained styles and 20 content categories, enriched with textual prompts to support supervision and controllability. It pairs this dataset with OmniFilter, a multimodal quality assessment framework using CLIP, DINOV2, Style30K-based contrastive learning, and InternVL2-based aesthetics to filter high-quality triplets, yielding robust data for training. The authors then propose OmniStyle, a Diffusion Transformer-based end-to-end framework that supports both instruction-guided and image-guided style transfer, utilizing a VAE+MM-DiT architecture and freezing most components during training to enable efficient fine-tuning. Across extensive quantitative, qualitative, and user studies, OmniStyle demonstrates superior style fidelity, content preservation, aesthetic appeal, and efficiency compared to state-of-the-art baselines, establishing a new baseline for scalable, high-quality style transfer. The work provides a valuable resource and methodological blueprint for researchers aiming to scale style transfer with precise control and broad style coverage.

Abstract

In this paper, we introduce OmniStyle-1M, a large-scale paired style transfer dataset comprising over one million content-style-stylized image triplets across 1,000 diverse style categories, each enhanced with textual descriptions and instruction prompts. We show that OmniStyle-1M can not only enable efficient and scalable of style transfer models through supervised training but also facilitate precise control over target stylization. Especially, to ensure the quality of the dataset, we introduce OmniFilter, a comprehensive style transfer quality assessment framework, which filters high-quality triplets based on content preservation, style consistency, and aesthetic appeal. Building upon this foundation, we propose OmniStyle, a framework based on the Diffusion Transformer (DiT) architecture designed for high-quality and efficient style transfer. This framework supports both instruction-guided and image-guided style transfer, generating high resolution outputs with exceptional detail. Extensive qualitative and quantitative evaluations demonstrate OmniStyle's superior performance compared to existing approaches, highlighting its efficiency and versatility. OmniStyle-1M and its accompanying methodologies provide a significant contribution to advancing high-quality style transfer, offering a valuable resource for the research community.

OmniStyle: Filtering High Quality Style Transfer Data at Scale

TL;DR

OmniStyle addresses key challenges in style transfer by introducing OmniStyle-1M, a large-scale paired dataset spanning 1,000 fine-grained styles and 20 content categories, enriched with textual prompts to support supervision and controllability. It pairs this dataset with OmniFilter, a multimodal quality assessment framework using CLIP, DINOV2, Style30K-based contrastive learning, and InternVL2-based aesthetics to filter high-quality triplets, yielding robust data for training. The authors then propose OmniStyle, a Diffusion Transformer-based end-to-end framework that supports both instruction-guided and image-guided style transfer, utilizing a VAE+MM-DiT architecture and freezing most components during training to enable efficient fine-tuning. Across extensive quantitative, qualitative, and user studies, OmniStyle demonstrates superior style fidelity, content preservation, aesthetic appeal, and efficiency compared to state-of-the-art baselines, establishing a new baseline for scalable, high-quality style transfer. The work provides a valuable resource and methodological blueprint for researchers aiming to scale style transfer with precise control and broad style coverage.

Abstract

In this paper, we introduce OmniStyle-1M, a large-scale paired style transfer dataset comprising over one million content-style-stylized image triplets across 1,000 diverse style categories, each enhanced with textual descriptions and instruction prompts. We show that OmniStyle-1M can not only enable efficient and scalable of style transfer models through supervised training but also facilitate precise control over target stylization. Especially, to ensure the quality of the dataset, we introduce OmniFilter, a comprehensive style transfer quality assessment framework, which filters high-quality triplets based on content preservation, style consistency, and aesthetic appeal. Building upon this foundation, we propose OmniStyle, a framework based on the Diffusion Transformer (DiT) architecture designed for high-quality and efficient style transfer. This framework supports both instruction-guided and image-guided style transfer, generating high resolution outputs with exceptional detail. Extensive qualitative and quantitative evaluations demonstrate OmniStyle's superior performance compared to existing approaches, highlighting its efficiency and versatility. OmniStyle-1M and its accompanying methodologies provide a significant contribution to advancing high-quality style transfer, offering a valuable resource for the research community.

Paper Structure

This paper contains 22 sections, 2 equations, 20 figures, 11 tables.

Figures (20)

  • Figure 1: OmniStyle enables high-quality (a) instruction-guided style transfer and (b) reference image-guided style transfer, covering a diverse range of styles, including but not limited to comics, vector art, oil painting, sketch, and Chinese ancient art. Note that in (a), a style image of the style descriptions is provided for illustration, and our method only takes a text instruction and a content image as input. In (b), results are generated in a traditional manner of style transfer, in which the model takes both the content and style images as input.
  • Figure 2: Overview of OmniStyle-1M. (a) The inner ring represents the eight primary categories, while the outer ring corresponds to specific fine-grained classifications, illustrating the extensive diversity of style categories within our dataset. (b) Two examples of triplets are shown, each includes a content image, a style image, a stylized output, a corresponding textual description, and an instructional prompt. (c) Distribution of stylized results across different content categories.
  • Figure 3: Overview of our dataset creation and filtering pipeline. (a) Content Image Generation: We utilize ChatGPT to automatically generate textual descriptions across 20 categories (e.g., animals, architecture, humans, food) and generate corresponding images using the FLUX model. (b) Style Transfer: Style images are randomly sampled from the Style30K dataset, and six SOTA style transfer models are applied to generate a large and diverse dataset of one million triplets. (c) OmniFilter: Stylized images are filtered based on content consistency, style preservation, and aesthetic appeal to ensure high-quality, visually cohesive results.
  • Figure 4: The architecture of OmniStyle.
  • Figure 5: Qualitative comparison with other state-of-the-art methods for the instruction-guided style transfer task. For clarity, the style images and style categories are placed on the right side of the content images for reference.
  • ...and 15 more figures