Table of Contents
Fetching ...

MonoSplat: Generalizable 3D Gaussian Splatting from Monocular Depth Foundation Models

Yifan Liu, Keyu Fan, Weihao Yu, Chenxin Li, Hao Lu, Yixuan Yuan

TL;DR

MonoSplat addresses the challenge of generalizable 3D Gaussian Splatting by leveraging frozen monocular depth foundation models to provide geometric priors. It introduces a two-part framework: a Mono-Multi Feature Adapter that transforms monocular features into cross-view representations, and an Integrated Gaussian Prediction path that fuses monocular and multi-view cues to produce accurate Gaussian primitives via a plane-sweep cost volume. The method achieves state-of-the-art or competitive results on RealEstate10K, ACID, and cross-domain DTU benchmarks, showing strong zero-shot generalization with a relatively modest training footprint thanks to a frozen backbone. These findings underscore the value of incorporating foundation-model priors for robust, real-time 3D reconstruction without per-scene optimization, enabling practical deployment in diverse real-world scenes.

Abstract

Recent advances in generalizable 3D Gaussian Splatting have demonstrated promising results in real-time high-fidelity rendering without per-scene optimization, yet existing approaches still struggle to handle unfamiliar visual content during inference on novel scenes due to limited generalizability. To address this challenge, we introduce MonoSplat, a novel framework that leverages rich visual priors from pre-trained monocular depth foundation models for robust Gaussian reconstruction. Our approach consists of two key components: a Mono-Multi Feature Adapter that transforms monocular features into multi-view representations, coupled with an Integrated Gaussian Prediction module that effectively fuses both feature types for precise Gaussian generation. Through the Adapter's lightweight attention mechanism, features are seamlessly aligned and aggregated across views while preserving valuable monocular priors, enabling the Prediction module to generate Gaussian primitives with accurate geometry and appearance. Through extensive experiments on diverse real-world datasets, we convincingly demonstrate that MonoSplat achieves superior reconstruction quality and generalization capability compared to existing methods while maintaining computational efficiency with minimal trainable parameters. Codes are available at https://github.com/CUHK-AIM-Group/MonoSplat.

MonoSplat: Generalizable 3D Gaussian Splatting from Monocular Depth Foundation Models

TL;DR

MonoSplat addresses the challenge of generalizable 3D Gaussian Splatting by leveraging frozen monocular depth foundation models to provide geometric priors. It introduces a two-part framework: a Mono-Multi Feature Adapter that transforms monocular features into cross-view representations, and an Integrated Gaussian Prediction path that fuses monocular and multi-view cues to produce accurate Gaussian primitives via a plane-sweep cost volume. The method achieves state-of-the-art or competitive results on RealEstate10K, ACID, and cross-domain DTU benchmarks, showing strong zero-shot generalization with a relatively modest training footprint thanks to a frozen backbone. These findings underscore the value of incorporating foundation-model priors for robust, real-time 3D reconstruction without per-scene optimization, enabling practical deployment in diverse real-world scenes.

Abstract

Recent advances in generalizable 3D Gaussian Splatting have demonstrated promising results in real-time high-fidelity rendering without per-scene optimization, yet existing approaches still struggle to handle unfamiliar visual content during inference on novel scenes due to limited generalizability. To address this challenge, we introduce MonoSplat, a novel framework that leverages rich visual priors from pre-trained monocular depth foundation models for robust Gaussian reconstruction. Our approach consists of two key components: a Mono-Multi Feature Adapter that transforms monocular features into multi-view representations, coupled with an Integrated Gaussian Prediction module that effectively fuses both feature types for precise Gaussian generation. Through the Adapter's lightweight attention mechanism, features are seamlessly aligned and aggregated across views while preserving valuable monocular priors, enabling the Prediction module to generate Gaussian primitives with accurate geometry and appearance. Through extensive experiments on diverse real-world datasets, we convincingly demonstrate that MonoSplat achieves superior reconstruction quality and generalization capability compared to existing methods while maintaining computational efficiency with minimal trainable parameters. Codes are available at https://github.com/CUHK-AIM-Group/MonoSplat.

Paper Structure

This paper contains 22 sections, 8 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Sketch of the proposed MonoSplat framework. We repurpose the depth foundation model to generalizable Gaussian reconstruction framework.
  • Figure 2: Illustration of the proposed MonoSplat framework. It consists of a Mono-Multi Feature Adapter to convert monocular features to multi-view features with cross-view awareness, and Integrated Gaussian Prediction to integrate monocular features into the Gaussian primitives prediction.
  • Figure 3: Visual comparisons of cross-dataset generalization. Using a model trained on RealEstate10k DBLP:journals/tog/ZhouTFFS18, we directly render scenes from DTU jensen2014large datasets. MonoSplat shows superior generalization ability compared to previous methods.
  • Figure 4: Visual comparisons of novel view synthesis on RealEstate10k DBLP:journals/tog/ZhouTFFS18. Models are trained with a collection of scenes from RealEstate10k, and tested on novel scenes from the test splits. MonoSplat shows superior rendering quality in challenging regions.
  • Figure 5: Visual comparisons of predicted 3D Gaussians (top) and depth maps (bottom). We examine the quality of geometric reconstruction by presenting comparative visualizations of 3D Gaussian primitives generated by pixelSplat, MVSplat, and our MonoSplat, alongside depth maps rendered from two reference viewpoints.
  • ...and 3 more figures