Table of Contents
Fetching ...

Splatent: Splatting Diffusion Latents for Novel View Synthesis

Or Hirschorn, Omer Sela, Inbar Huberman-Spiegelglas, Netalee Efrat, Eli Alshan, Ianir Ideses, Frederic Devernay, Yochai Zvik, Lior Fritz

TL;DR

Splatent addresses the challenge of multi-view inconsistency in VAE latent spaces used for diffusion-based 3D radiance fields by recovering high-frequency details in 2D space through multi-view attention. It introduces a two-stage approach: latent 3D Gaussian splatting in the VAE latent space, followed by a diffusion-based refinement that fuses rendered latents with nearby reference views, all while keeping the VAE frozen. The method achieves state-of-the-art results on dense and sparse view scenarios across DL3DV-10K, LLFF, and Mip-NeRF360, and can enhance feed-forward latent 3DGS models like MVSplat360. This work enables more faithful, detail-preserving novel-view synthesis in latent-space pipelines, with practical impact on memory-efficient 3D reconstruction and diffusion-based rendering.

Abstract

Radiance field representations have recently been explored in the latent space of VAEs that are commonly used by diffusion models. This direction offers efficient rendering and seamless integration with diffusion-based pipelines. However, these methods face a fundamental limitation: The VAE latent space lacks multi-view consistency, leading to blurred textures and missing details during 3D reconstruction. Existing approaches attempt to address this by fine-tuning the VAE, at the cost of reconstruction quality, or by relying on pre-trained diffusion models to recover fine-grained details, at the risk of some hallucinations. We present Splatent, a diffusion-based enhancement framework designed to operate on top of 3D Gaussian Splatting (3DGS) in the latent space of VAEs. Our key insight departs from the conventional 3D-centric view: rather than reconstructing fine-grained details in 3D space, we recover them in 2D from input views through multi-view attention mechanisms. This approach preserves the reconstruction quality of pretrained VAEs while achieving faithful detail recovery. Evaluated across multiple benchmarks, Splatent establishes a new state-of-the-art for VAE latent radiance field reconstruction. We further demonstrate that integrating our method with existing feed-forward frameworks, consistently improves detail preservation, opening new possibilities for high-quality sparse-view 3D reconstruction.

Splatent: Splatting Diffusion Latents for Novel View Synthesis

TL;DR

Splatent addresses the challenge of multi-view inconsistency in VAE latent spaces used for diffusion-based 3D radiance fields by recovering high-frequency details in 2D space through multi-view attention. It introduces a two-stage approach: latent 3D Gaussian splatting in the VAE latent space, followed by a diffusion-based refinement that fuses rendered latents with nearby reference views, all while keeping the VAE frozen. The method achieves state-of-the-art results on dense and sparse view scenarios across DL3DV-10K, LLFF, and Mip-NeRF360, and can enhance feed-forward latent 3DGS models like MVSplat360. This work enables more faithful, detail-preserving novel-view synthesis in latent-space pipelines, with practical impact on memory-efficient 3D reconstruction and diffusion-based rendering.

Abstract

Radiance field representations have recently been explored in the latent space of VAEs that are commonly used by diffusion models. This direction offers efficient rendering and seamless integration with diffusion-based pipelines. However, these methods face a fundamental limitation: The VAE latent space lacks multi-view consistency, leading to blurred textures and missing details during 3D reconstruction. Existing approaches attempt to address this by fine-tuning the VAE, at the cost of reconstruction quality, or by relying on pre-trained diffusion models to recover fine-grained details, at the risk of some hallucinations. We present Splatent, a diffusion-based enhancement framework designed to operate on top of 3D Gaussian Splatting (3DGS) in the latent space of VAEs. Our key insight departs from the conventional 3D-centric view: rather than reconstructing fine-grained details in 3D space, we recover them in 2D from input views through multi-view attention mechanisms. This approach preserves the reconstruction quality of pretrained VAEs while achieving faithful detail recovery. Evaluated across multiple benchmarks, Splatent establishes a new state-of-the-art for VAE latent radiance field reconstruction. We further demonstrate that integrating our method with existing feed-forward frameworks, consistently improves detail preservation, opening new possibilities for high-quality sparse-view 3D reconstruction.

Paper Structure

This paper contains 31 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Novel view synthesis from a latent-space radiance field. Splatent is a principled framework to enhance rendered novel views from a radiance field in the latent space of diffusion VAEs. We demonstrate improvements in image quality in the setting of test-time latent radiance field optimization, compared to LRF zhou2025latent. In addition, we show how Splatent can be connected within a latent-based feed-forward model like MVSplat360 chen2024mvsplat360 to enhance the results and reduce hallucinations.
  • Figure 2: Framework Overview. Given a set of input views with known camera parameters, each image is encoded into the VAE latent space of a diffusion model. We then perform 3DGS optimization to reconstruct the underlying latent radiance field. Due to multi-view inconsistencies in diffusion VAEs latent space, a rendered novel view latent lacks high frequency details. We tile this rendered view together with reference views into a grid, and leverage a single-step diffusion model with self-attention mechanism that aggregates information across all views. The enhanced latent image is finally decoded to receive the novel view image.
  • Figure 4: Qualitative comparison. We compare Splatent to other latent radiance field methods on novel view synthesis reconstruction quality. Feature-3DGS zhou2024feature exhibits considerable loss of detail, and LRF zhou2025latent improves upon this baseline but still fails to recover fine details. In contrast, Splatent produces sharper and more faithful reconstructions. The scenes are taken from the DL3DV-10K dataset.
  • Figure 5: Feed-Forward Qualitative comparison. We demonstrate how Splatent can enhance feed-forward latent radiance field methods such as MVSplat360 chen2024mvsplat360. While MVSplat360 often hallucinates (e.g., the window in the first example or the tree in the last example) and lacks fine details, Splatent yields sharper and more faithful reconstructions.
  • Figure : (a)
  • ...and 2 more figures