Table of Contents
Fetching ...

GaussianBlender: Instant Stylization of 3D Gaussians with Disentangled Latent Spaces

Melis Ocal, Xiaoyan Xing, Yue Li, Ngo Anh Vien, Sezer Karaoglu, Theo Gevers

TL;DR

GaussianBlender presents a diffusion-based, feed-forward editor for 3D Gaussian splats that eliminates per-asset optimization by learning latent priors in a disentangled geometry-appearance space. The method groups Gaussians spatially, encodes them into dual latents, and uses a latent diffusion model conditioned on text to apply edits, with a final stage that maps source to edited latents while preserving geometry. Across quantitative metrics and user studies, it achieves geometry-preserving, multi-view-consistent stylization with near real-time inference and generalizes to out-of-domain assets, enabling scalable 3D stylization for production. The approach demonstrates strong advantages over prior optimization-based methods and related feed-forward editors by offering controlled editing and robust 3D consistency in large-scale workflows.

Abstract

3D stylization is central to game development, virtual reality, and digital arts, where the demand for diverse assets calls for scalable methods that support fast, high-fidelity manipulation. Existing text-to-3D stylization methods typically distill from 2D image editors, requiring time-intensive per-asset optimization and exhibiting multi-view inconsistency due to the limitations of current text-to-image models, which makes them impractical for large-scale production. In this paper, we introduce GaussianBlender, a pioneering feed-forward framework for text-driven 3D stylization that performs edits instantly at inference. Our method learns structured, disentangled latent spaces with controlled information sharing for geometry and appearance from spatially-grouped 3D Gaussians. A latent diffusion model then applies text-conditioned edits on these learned representations. Comprehensive evaluations show that GaussianBlender not only delivers instant, high-fidelity, geometry-preserving, multi-view consistent stylization, but also surpasses methods that require per-instance test-time optimization - unlocking practical, democratized 3D stylization at scale.

GaussianBlender: Instant Stylization of 3D Gaussians with Disentangled Latent Spaces

TL;DR

GaussianBlender presents a diffusion-based, feed-forward editor for 3D Gaussian splats that eliminates per-asset optimization by learning latent priors in a disentangled geometry-appearance space. The method groups Gaussians spatially, encodes them into dual latents, and uses a latent diffusion model conditioned on text to apply edits, with a final stage that maps source to edited latents while preserving geometry. Across quantitative metrics and user studies, it achieves geometry-preserving, multi-view-consistent stylization with near real-time inference and generalizes to out-of-domain assets, enabling scalable 3D stylization for production. The approach demonstrates strong advantages over prior optimization-based methods and related feed-forward editors by offering controlled editing and robust 3D consistency in large-scale workflows.

Abstract

3D stylization is central to game development, virtual reality, and digital arts, where the demand for diverse assets calls for scalable methods that support fast, high-fidelity manipulation. Existing text-to-3D stylization methods typically distill from 2D image editors, requiring time-intensive per-asset optimization and exhibiting multi-view inconsistency due to the limitations of current text-to-image models, which makes them impractical for large-scale production. In this paper, we introduce GaussianBlender, a pioneering feed-forward framework for text-driven 3D stylization that performs edits instantly at inference. Our method learns structured, disentangled latent spaces with controlled information sharing for geometry and appearance from spatially-grouped 3D Gaussians. A latent diffusion model then applies text-conditioned edits on these learned representations. Comprehensive evaluations show that GaussianBlender not only delivers instant, high-fidelity, geometry-preserving, multi-view consistent stylization, but also surpasses methods that require per-instance test-time optimization - unlocking practical, democratized 3D stylization at scale.

Paper Structure

This paper contains 27 sections, 11 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Given a 3D Gaussian splat asset and an edit prompt, GaussianBlender - a diffusion-based feed-forward style editor - generates modified assets instantly, fully eliminating per-asset test-time optimization. GaussianBlender delivers high-fidelity, geometry-preserving, multi-view consistent stylizations and supports interactive appearance editing - unlocking practical, democratized 3D stylization at scale.
  • Figure 2: Overview of our method.(1) Latent space learning: Given input Gaussians, our method first groups them based on spatial proximity and encodes into group-structured disentangled latent spaces, with controlled cross-branch feature sharing. (2) Latent diffusion pre-training: A denoiser $d_{\varrho}$ then learns to denoise the noisy appearance latent ${\mathbf{z}_{c}^{s}}_\mathbf{T}$ conditioned on embedding $\mathcal{C}$. (3) Latent editing: Once 3D priors are captured, $d_{\varrho}$ is further trained to learn an editing function $g(\cdot)$ that maps latent $\mathbf{z}_{c}^{s}$ to a modified latent $\mathbf{z}_{c}^{e}$ based on embedding $\mathcal{C}^{e}$, guided by the geometry latent ${\mathbf{z}_{g}^{s}}$. At inference, GaussianBlender generates modified high-quality, 3D-consistent assets from text prompts in a single feed-forward pass instantly, fully eliminating test-time optimization. Trainable models at each stage are denoted.
  • Figure 3: Qualitative comparison with state-of-the-art methods. Unlike baselines that yield over-saturated, dramatic edits that alter 3D structure (e.g., blurred boundaries; “Make it pop-art neon duotone”), minimal edits that are barely perceptible (“Make its colors look like a rainbow”), or severe geometric distortions, GaussianBlender delivers high-fidelity, text-aligned 3D stylizations with strong geometry preservation instantly, with a single feed-forward pass.
  • Figure 4: Cross-dataset generalization on OmniObject3D omniobject3d. Our framework demonstrates strong style editing performance on out-of-distribution 3D assets.
  • Figure 5: Visual assessment of Gaussian VAE reconstructions. The proposed dual-branch Gaussian VAE yields sharper reconstructions with better geometric fidelity and color accuracy.
  • ...and 6 more figures