Table of Contents
Fetching ...

CLIPGaussian: Universal and Multimodal Style Transfer Based on Gaussian Splatting

Kornel Howil, Joanna Waczyńska, Piotr Borycki, Tadeusz Dziarmaga, Marcin Mazur, Przemysław Spurek

TL;DR

CLIPGaussian addresses universal multimodal style transfer for Gaussian Splatting by introducing a plug-in framework that supports text- and image-guided stylization across 2D, video, 3D, and 4D data. It employs a two-stage pipeline: train a modality-specific GS base model, then fine-tune Gaussian primitives through CLIP- and VGG-based losses to jointly edit appearance and geometry without increasing model size. The method achieves high style fidelity and temporal coherence, outperforming baselines in text-guided stylization while remaining competitive for image-guided tasks, and preserves the original Gaussian count to maintain efficiency. This work enables efficient, end-to-end cross-modal stylization suitable for AR/VR, film, and digital content creation workflows.

Abstract

Gaussian Splatting (GS) has recently emerged as an efficient representation for rendering 3D scenes from 2D images and has been extended to images, videos, and dynamic 4D content. However, applying style transfer to GS-based representations, especially beyond simple color changes, remains challenging. In this work, we introduce CLIPGaussian, the first unified style transfer framework that supports text- and image-guided stylization across multiple modalities: 2D images, videos, 3D objects, and 4D scenes. Our method operates directly on Gaussian primitives and integrates into existing GS pipelines as a plug-in module, without requiring large generative models or retraining from scratch. The CLIPGaussian approach enables joint optimization of color and geometry in 3D and 4D settings, and achieves temporal coherence in videos, while preserving the model size. We demonstrate superior style fidelity and consistency across all tasks, validating CLIPGaussian as a universal and efficient solution for multimodal style transfer.

CLIPGaussian: Universal and Multimodal Style Transfer Based on Gaussian Splatting

TL;DR

CLIPGaussian addresses universal multimodal style transfer for Gaussian Splatting by introducing a plug-in framework that supports text- and image-guided stylization across 2D, video, 3D, and 4D data. It employs a two-stage pipeline: train a modality-specific GS base model, then fine-tune Gaussian primitives through CLIP- and VGG-based losses to jointly edit appearance and geometry without increasing model size. The method achieves high style fidelity and temporal coherence, outperforming baselines in text-guided stylization while remaining competitive for image-guided tasks, and preserves the original Gaussian count to maintain efficiency. This work enables efficient, end-to-end cross-modal stylization suitable for AR/VR, film, and digital content creation workflows.

Abstract

Gaussian Splatting (GS) has recently emerged as an efficient representation for rendering 3D scenes from 2D images and has been extended to images, videos, and dynamic 4D content. However, applying style transfer to GS-based representations, especially beyond simple color changes, remains challenging. In this work, we introduce CLIPGaussian, the first unified style transfer framework that supports text- and image-guided stylization across multiple modalities: 2D images, videos, 3D objects, and 4D scenes. Our method operates directly on Gaussian primitives and integrates into existing GS pipelines as a plug-in module, without requiring large generative models or retraining from scratch. The CLIPGaussian approach enables joint optimization of color and geometry in 3D and 4D settings, and achieves temporal coherence in videos, while preserving the model size. We demonstrate superior style fidelity and consistency across all tasks, validating CLIPGaussian as a universal and efficient solution for multimodal style transfer.

Paper Structure

This paper contains 22 sections, 13 equations, 37 figures, 22 tables.

Figures (37)

  • Figure 1: We present CLIPGaussian, a universal model for style transfer that supports a wide range of data modalities, including images, videos, 3D objects, and 4D dynamic scenes. Style transfer in CLIPGaussian can be guided using an image or a text prompt. Our method leverages a Gaussian Splatting representation to model both color and geometric aspects of style transfer.
  • Figure 2: CLIPGaussian architecture in the case of a 4D dynamic scene. The method operates in two main stages. In the first stage, we train a Gaussian Splatting model tailored to a specific data modality. In the second stage, during training, we leverage training images, randomly sampled patches, and conditioning inputs (either an image or a text) in the feature spaces of VGG-19 and CLIP models. We optimize the Gaussian parameters using a composite loss function with four terms: content preservation, background preservation, local style transfer, and global style transfer. Notably, CLIPGaussian integrates with GS-based systems as a plug-in module.
  • Figure 3: Results of text-based 4D style transfer. Our model modifies both the color and geometry of Gaussian primitives.
  • Figure 4: We conduct a user study, comparing our model against baseline methods. CLIPGaussian achieves scores comparable to G-Style with image conditioning and outperforms all models when using text prompts.
  • Figure 5: Comparison of various 3D style transfer methods involving text conditioning. CLIPGaussian applies style transfer with more significant shape changes. Our model captures details by attending to local regions through patches processing.
  • ...and 32 more figures