Table of Contents
Fetching ...

ZDySS -- Zero-Shot Dynamic Scene Stylization using Gaussian Splatting

Abhishek Saroha, Florian Hofherr, Mariia Gladkova, Cecilia Curreli, Or Litany, Daniel Cremers

TL;DR

ZDySS tackles zero-shot stylization for dynamic scenes by extending 3D Gaussian splatting with per-Gaussian semantic features aligned to 2D VGG features, enabling AdaIN-based style transfer in 3D space. It applies Adaptive Instance Normalization in the learned feature space, while a running-average of rendered features enforces spatio-temporal consistency, allowing unseen styles at inference without per-style optimization; the operation relies on $AdaIN(F_c, F_s)= \sigma(F_s)\left(\frac{F_c-\mu(F_c)}{\sigma(F_c)}\right)+\mu(F_s)$ and on statistics $(\mu_{avg}, \sigma_{ma})$. The method is trained end-to-end with a feature-space supervision from a pretrained encoder and a reconstruction term, yielding competitive results against baselines on real-world dynamic scenes. This work advances practical scene editing for dynamic environments, with potential applications in games, film, and AR/VR.

Abstract

Stylizing a dynamic scene based on an exemplar image is critical for various real-world applications, including gaming, filmmaking, and augmented and virtual reality. However, achieving consistent stylization across both spatial and temporal dimensions remains a significant challenge. Most existing methods are designed for static scenes and often require an optimization process for each style image, limiting their adaptability. We introduce ZDySS, a zero-shot stylization framework for dynamic scenes, allowing our model to generalize to previously unseen style images at inference. Our approach employs Gaussian splatting for scene representation, linking each Gaussian to a learned feature vector that renders a feature map for any given view and timestamp. By applying style transfer on the learned feature vectors instead of the rendered feature map, we enhance spatio-temporal consistency across frames. Our method demonstrates superior performance and coherence over state-of-the-art baselines in tests on real-world dynamic scenes, making it a robust solution for practical applications.

ZDySS -- Zero-Shot Dynamic Scene Stylization using Gaussian Splatting

TL;DR

ZDySS tackles zero-shot stylization for dynamic scenes by extending 3D Gaussian splatting with per-Gaussian semantic features aligned to 2D VGG features, enabling AdaIN-based style transfer in 3D space. It applies Adaptive Instance Normalization in the learned feature space, while a running-average of rendered features enforces spatio-temporal consistency, allowing unseen styles at inference without per-style optimization; the operation relies on and on statistics . The method is trained end-to-end with a feature-space supervision from a pretrained encoder and a reconstruction term, yielding competitive results against baselines on real-world dynamic scenes. This work advances practical scene editing for dynamic environments, with potential applications in games, film, and AR/VR.

Abstract

Stylizing a dynamic scene based on an exemplar image is critical for various real-world applications, including gaming, filmmaking, and augmented and virtual reality. However, achieving consistent stylization across both spatial and temporal dimensions remains a significant challenge. Most existing methods are designed for static scenes and often require an optimization process for each style image, limiting their adaptability. We introduce ZDySS, a zero-shot stylization framework for dynamic scenes, allowing our model to generalize to previously unseen style images at inference. Our approach employs Gaussian splatting for scene representation, linking each Gaussian to a learned feature vector that renders a feature map for any given view and timestamp. By applying style transfer on the learned feature vectors instead of the rendered feature map, we enhance spatio-temporal consistency across frames. Our method demonstrates superior performance and coherence over state-of-the-art baselines in tests on real-world dynamic scenes, making it a robust solution for practical applications.
Paper Structure (26 sections, 9 equations, 4 figures, 3 tables)

This paper contains 26 sections, 9 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Method Overview. The above figure provides an overview of our method Z-DySS. During the training phase, we follow a straightforward pipeline, similar to zhou2024feature3dgslabe2024dgd. In addition, we compute the moving average mean and sigma of the rendered feature map that is used during the inference time to normalize the learnt semantic feature vector $f_i$ of each 3D Gaussian, before being scaled and shifted by the feature properties of the style image $S_i$. We then render these stylized feature map for a given view and timestep, before decoding to obtain the stylized novel view.
  • Figure 2: Qualitative Results Here we show a comparative study of ZDySS against the baselines, namely S-DyRF, Ada-4DGS, and 4DGS-Ada. It can be observed here that, despite not being optimized on every queried style image, ZDySS is able to faithfully stylize the given scene at various timesteps and viewpoints. ZDySS also retains most details out of all the methods, while carrying the style information. For instance, Ada-4DGS and 4DGS-Ada suffer from the problem of having spikey and elongated Gaussians, along with strong blurriness, especially along the high frequency regions. S-Dyrf on the other hand, suffers from blurriness as compared to our method. In addition, we also provide videos in the supplementary that are effective in displaying the consistency differences between ours and the mentioned baselines.
  • Figure 3: Style Interpolation We interpolate between the latent vectors of two different style images at test time, obtaining meaningful stylizations as we move from one style to another.
  • Figure 4: Pretraining Pretraining the scene initially without the feature map supervision helps retain finer details in the stylized outputs. We pretrain the scene for 14000 iterations, as suggested in wu20244dgs.