Table of Contents
Fetching ...

M2StyleGS: Multi-Modality 3D Style Transfer with Gaussian Splatting

Xingyu Miao, Xueqi Qiu, Haoran Duan, Yawen Huang, Xian Wu, Jingjing Deng, Yang Long

Abstract

Conventional 3D style transfer methods rely on a fixed reference image to apply artistic patterns to 3D scenes. However, in practical applications such as virtual or augmented reality, users often prefer more flexible inputs, including textual descriptions and diverse imagery. In this work, we introduce a novel real-time styling technique M2StyleGS to generate a sequence of precisely color-mapped views. It utilizes 3D Gaussian Splatting (3DGS) as a 3D presentation and multi-modality knowledge refined by CLIP as a reference style. M2StyleGS resolves the abnormal transformation issue by employing a precise feature alignment, namely subdivisive flow, it strengthens the projection of the mapped CLIP text-visual combination feature to the VGG style feature. In addition, we introduce observation loss, which assists in the stylized scene better matching the reference style during the generation, and suppression loss, which suppresses the offset of reference color information throughout the decoding process. By integrating these approaches, M2StyleGS can employ text or images as references to generate a set of style-enhanced novel views. Our experiments show that M2StyleGS achieves better visual quality and surpasses the previous work by up to 32.92% in terms of consistency.

M2StyleGS: Multi-Modality 3D Style Transfer with Gaussian Splatting

Abstract

Conventional 3D style transfer methods rely on a fixed reference image to apply artistic patterns to 3D scenes. However, in practical applications such as virtual or augmented reality, users often prefer more flexible inputs, including textual descriptions and diverse imagery. In this work, we introduce a novel real-time styling technique M2StyleGS to generate a sequence of precisely color-mapped views. It utilizes 3D Gaussian Splatting (3DGS) as a 3D presentation and multi-modality knowledge refined by CLIP as a reference style. M2StyleGS resolves the abnormal transformation issue by employing a precise feature alignment, namely subdivisive flow, it strengthens the projection of the mapped CLIP text-visual combination feature to the VGG style feature. In addition, we introduce observation loss, which assists in the stylized scene better matching the reference style during the generation, and suppression loss, which suppresses the offset of reference color information throughout the decoding process. By integrating these approaches, M2StyleGS can employ text or images as references to generate a set of style-enhanced novel views. Our experiments show that M2StyleGS achieves better visual quality and surpasses the previous work by up to 32.92% in terms of consistency.

Paper Structure

This paper contains 17 sections, 13 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Multi-modality 3D style transfer with M$^2$StyleGS. Employing a set of 3D scene images captured from various perspectives, M$^2$StyleGS can effectively apply reference styles described in arbitrary images or text.
  • Figure 2: The pipeline of M$^2$StyleGS. During the training phase, M$^2$StyleGS performs style transfer using style images. After training, M$^2$StyleGS is capable of applying multi-modality stylistic transformations directly using either images or text (the purple dashed arrows illustrate text feature processing.)
  • Figure 3: Comparison of the previous alignment method with the proposed subdivisive flow. (a) illustrates that the direct use of the mapping module can result in mismatches in feature projection. (b) illustrates the subdivisive flow learns the ODE that characterizes the trajectory from the CLIP feature flow to the VGG feature space precisely.
  • Figure 4: Qualitative comparison with SOTA 3D style transfer methods.$\mathrm{M^{2}StyleGS}$ can achieve precise style transfer based on the reference style images or reference style text.
  • Figure 5: Ablation studies on subdivisive flow. The top row shows the $r$-th subdivisve round outcomes and their feature distributions among image style transfer. The bottom row shows the corresponding text style transfer, i.e., the original text feature distribution is the same as the style image feature distribution. Hence, the initial SIM and FID are the same.
  • ...and 1 more figures