ZDySS -- Zero-Shot Dynamic Scene Stylization using Gaussian Splatting
Abhishek Saroha, Florian Hofherr, Mariia Gladkova, Cecilia Curreli, Or Litany, Daniel Cremers
TL;DR
ZDySS tackles zero-shot stylization for dynamic scenes by extending 3D Gaussian splatting with per-Gaussian semantic features aligned to 2D VGG features, enabling AdaIN-based style transfer in 3D space. It applies Adaptive Instance Normalization in the learned feature space, while a running-average of rendered features enforces spatio-temporal consistency, allowing unseen styles at inference without per-style optimization; the operation relies on $AdaIN(F_c, F_s)= \sigma(F_s)\left(\frac{F_c-\mu(F_c)}{\sigma(F_c)}\right)+\mu(F_s)$ and on statistics $(\mu_{avg}, \sigma_{ma})$. The method is trained end-to-end with a feature-space supervision from a pretrained encoder and a reconstruction term, yielding competitive results against baselines on real-world dynamic scenes. This work advances practical scene editing for dynamic environments, with potential applications in games, film, and AR/VR.
Abstract
Stylizing a dynamic scene based on an exemplar image is critical for various real-world applications, including gaming, filmmaking, and augmented and virtual reality. However, achieving consistent stylization across both spatial and temporal dimensions remains a significant challenge. Most existing methods are designed for static scenes and often require an optimization process for each style image, limiting their adaptability. We introduce ZDySS, a zero-shot stylization framework for dynamic scenes, allowing our model to generalize to previously unseen style images at inference. Our approach employs Gaussian splatting for scene representation, linking each Gaussian to a learned feature vector that renders a feature map for any given view and timestamp. By applying style transfer on the learned feature vectors instead of the rendered feature map, we enhance spatio-temporal consistency across frames. Our method demonstrates superior performance and coherence over state-of-the-art baselines in tests on real-world dynamic scenes, making it a robust solution for practical applications.
