Table of Contents
Fetching ...

StyleMe3D: Stylization with Disentangled Priors by Multiple Encoders on 3D Gaussians

Cailin Zhuang, Yaoqi Hu, Xuanyang Zhang, Wei Cheng, Jiacheng Bao, Shengqi Liu, Yiying Yang, Xianfang Zeng, Gang Yu, Ming Li

TL;DR

StyleMe3D presents a post-training framework for stylizing 3D Gaussian Splatting (3D GS) by exclusively optimizing color attributes while preserving geometry. It introduces four encoders—Dynamic Style Score Distillation (DSSD), Simultaneously Optimized Scale (SOS), Contrastive Style Descriptor (CSD), and 3D Gaussian Quality Assessment (3DG-QA)—to achieve semantic-aware, texture-rich, and aesthetically coherent 3D stylization, leveraging Stable Diffusion latent space for semantic guidance. The approach integrates multi-scale texture alignment, mid-level style descriptors, and perceptual quality priors into a single objective, $\mathcal{L}_{final} = \lambda_{1} \mathcal{L}_{style} + \lambda_{2} \mathcal{L}_{SOS} + \lambda_{3} \mathcal{L}_{CSD} + \lambda_{4} \mathcal{L}_{3DG-QA}$, and demonstrates superior preservation of geometry (e.g., carvings) and cross-scene stylistic consistency on NeRF and tandt db datasets with real-time rendering. By bridging photorealistic 3D GS and artistic stylization, StyleMe3D enables applications in gaming, virtual worlds, and digital art with robust, semantically guided style transfer. The work highlights the importance of decoupled geometry, multi-level semantics, and perceptual quality priors in achieving high-fidelity 3D style transfer at interactive speeds.

Abstract

3D Gaussian Splatting (3DGS) excels in photorealistic scene reconstruction but struggles with stylized scenarios (e.g., cartoons, games) due to fragmented textures, semantic misalignment, and limited adaptability to abstract aesthetics. We propose StyleMe3D, a holistic framework for 3D GS style transfer that integrates multi-modal style conditioning, multi-level semantic alignment, and perceptual quality enhancement. Our key insights include: (1) optimizing only RGB attributes preserves geometric integrity during stylization; (2) disentangling low-, medium-, and high-level semantics is critical for coherent style transfer; (3) scalability across isolated objects and complex scenes is essential for practical deployment. StyleMe3D introduces four novel components: Dynamic Style Score Distillation (DSSD), leveraging Stable Diffusion's latent space for semantic alignment; Contrastive Style Descriptor (CSD) for localized, content-aware texture transfer; Simultaneously Optimized Scale (SOS) to decouple style details and structural coherence; and 3D Gaussian Quality Assessment (3DG-QA), a differentiable aesthetic prior trained on human-rated data to suppress artifacts and enhance visual harmony. Evaluated on NeRF synthetic dataset (objects) and tandt db (scenes) datasets, StyleMe3D outperforms state-of-the-art methods in preserving geometric details (e.g., carvings on sculptures) and ensuring stylistic consistency across scenes (e.g., coherent lighting in landscapes), while maintaining real-time rendering. This work bridges photorealistic 3D GS and artistic stylization, unlocking applications in gaming, virtual worlds, and digital art.

StyleMe3D: Stylization with Disentangled Priors by Multiple Encoders on 3D Gaussians

TL;DR

StyleMe3D presents a post-training framework for stylizing 3D Gaussian Splatting (3D GS) by exclusively optimizing color attributes while preserving geometry. It introduces four encoders—Dynamic Style Score Distillation (DSSD), Simultaneously Optimized Scale (SOS), Contrastive Style Descriptor (CSD), and 3D Gaussian Quality Assessment (3DG-QA)—to achieve semantic-aware, texture-rich, and aesthetically coherent 3D stylization, leveraging Stable Diffusion latent space for semantic guidance. The approach integrates multi-scale texture alignment, mid-level style descriptors, and perceptual quality priors into a single objective, , and demonstrates superior preservation of geometry (e.g., carvings) and cross-scene stylistic consistency on NeRF and tandt db datasets with real-time rendering. By bridging photorealistic 3D GS and artistic stylization, StyleMe3D enables applications in gaming, virtual worlds, and digital art with robust, semantically guided style transfer. The work highlights the importance of decoupled geometry, multi-level semantics, and perceptual quality priors in achieving high-fidelity 3D style transfer at interactive speeds.

Abstract

3D Gaussian Splatting (3DGS) excels in photorealistic scene reconstruction but struggles with stylized scenarios (e.g., cartoons, games) due to fragmented textures, semantic misalignment, and limited adaptability to abstract aesthetics. We propose StyleMe3D, a holistic framework for 3D GS style transfer that integrates multi-modal style conditioning, multi-level semantic alignment, and perceptual quality enhancement. Our key insights include: (1) optimizing only RGB attributes preserves geometric integrity during stylization; (2) disentangling low-, medium-, and high-level semantics is critical for coherent style transfer; (3) scalability across isolated objects and complex scenes is essential for practical deployment. StyleMe3D introduces four novel components: Dynamic Style Score Distillation (DSSD), leveraging Stable Diffusion's latent space for semantic alignment; Contrastive Style Descriptor (CSD) for localized, content-aware texture transfer; Simultaneously Optimized Scale (SOS) to decouple style details and structural coherence; and 3D Gaussian Quality Assessment (3DG-QA), a differentiable aesthetic prior trained on human-rated data to suppress artifacts and enhance visual harmony. Evaluated on NeRF synthetic dataset (objects) and tandt db (scenes) datasets, StyleMe3D outperforms state-of-the-art methods in preserving geometric details (e.g., carvings on sculptures) and ensuring stylistic consistency across scenes (e.g., coherent lighting in landscapes), while maintaining real-time rendering. This work bridges photorealistic 3D GS and artistic stylization, unlocking applications in gaming, virtual worlds, and digital art.

Paper Structure

This paper contains 33 sections, 28 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Overview of our 3D stylization framework (StyleMe3D): (a) Style Purification: Extracts and refines style representations via Style Cleaning in CLIP space, removing content interference from reference images. (b) Multi-Expert Stylization: The Dynamic Style Score Distillation (DSSD) module employs dynamic noise scheduling and adaptive style guidance, integrating latent losses to achieve consistent stylization step by step. Integrates three specialized components within the Dynamic Style Score Distillation (DSSD) framework: Simultaneously Optimized Scale (SOS): Adaptive noise scheduling for texture preservation. Contrastive Style Descriptor (CSD): Separates style and content via contrastive learning for style similarity score. CLIP-IQA: Quality-guided refinement using antonymic semantic prompts. (c) Progressive Consistency Optimization (Style Outpainting): Progressive outpainting achieves multi-view style propagation. Ensures coherent through iterative latent alignment, eliminating multi-view dependencies.
  • Figure 2: Visual Result. Demonstration of our method's performance across five styles (vangogh wheat field, star night, fire nezha, colorful oil, and lighting tiger) applied to five objects(chair, ship, hotdog, lego and mic) and two scenes (man face and train). The results illustrate our model's capability to handle two main categories of styles: (1) Non-photorealistic Art Styles (e.g., cartoon, drawing), showcasing traditional artistic expressions, and (2) State-based Styles (e.g., fire, oil), which capture physical properties. This figure demonstrates our method's versatility and semantic-aware ability in stylizing 3D models while preserving style fidelity and geometric consistency across diverse artistic and physical characteristics. For Example, semantic separation of the legs of the chair from the seat cushion, detail texture of chair, texture of the fire on the hot dog, and metallic sheen on the mic are all effectively preserved.
  • Figure 3: Qualitative Comparisons on Object Level Stylization. We compare our method against other SOTA (SGSST \ref{['galerne2024sgsst']}, StyleGaussian \ref{['liu2024stylegaussian']} and ARF \ref{['zhang2022arf']}) on nerf synthetic dataset (selected chair, hotdog, and mic) using vangogh wheat field, fire nezha, and sketch styles. The horizontal axis represents the compared methods, and the vertical axis displays different data. Our method effectively retains semantic and details of original model and style feature of reference image, such as semantic separation of the legs of the chair from the seat cushion, texture of the fire on the hot dog, and metallic sheen on the mic. Compared to others, our method exhibits stronger semantic understanding, clearly distinguishing elements like the cushions, backrest and legs on the chair.
  • Figure 4: Qualitative Comparisons on Scene Level Stylization. We compare our method against other SOTA (SGSST \ref{['galerne2024sgsst']}, StyleGaussian \ref{['liu2024stylegaussian']} and ARF \ref{['zhang2022arf']}) on tandt db dataset (selected truck and train) using landscope and lighting tiger styles. The horizontal axis represents the compared methods, and the vertical axis displays different data. Our method effectively retains semantic and details of original model and style feature of reference image, such as the truck wheel and train fence (as shown in Zoom-in). Compared to others, our method exhibits stronger semantic understanding, clearly distinguishing elements like the fence, tire and rail.
  • Figure 5: Ablation study on style outpainting guidance mode. (a) Baseline without style outpainting exhibits limited stylization scope and view-dependent artifacts (red boxes). (b) Local Guidance enables single-view enhancement but causes multi-view inconsistencies. Global-Local Fusion achieves cross-view style propagation through adaptive attention weighting, improving style consistency while preserving view-specific details.
  • ...and 6 more figures