Table of Contents
Fetching ...

Feat2GS: Probing Visual Foundation Models with Gaussian Splatting

Yue Chen, Xingyu Chen, Anpei Chen, Gerard Pons-Moll, Yuliang Xiu

TL;DR

Visual foundation models trained primarily on 2D data often lack explicit 3D texture handling. Feat2GS maps frozen VFM features to a dense 3D Gaussian Splatting representation and uses novel-view synthesis as a dense 3D proxy, avoiding ground-truth 3D labels. The study reveals that modern VFMs generally capture geometry well but have limited texture awareness, with improvements when texture-preserving pretraining (e.g., MAE) or 3D data is leveraged, and gains from simple feature ensembling. Overall, Feat2GS serves as a practical probing tool and a competitive baseline for NVS, guiding future development of 3D-aware visual foundation models.

Abstract

Given that visual foundation models (VFMs) are trained on extensive datasets but often limited to 2D images, a natural question arises: how well do they understand the 3D world? With the differences in architecture and training protocols (i.e., objectives, proxy tasks), a unified framework to fairly and comprehensively probe their 3D awareness is urgently needed. Existing works on 3D probing suggest single-view 2.5D estimation (e.g., depth and normal) or two-view sparse 2D correspondence (e.g., matching and tracking). Unfortunately, these tasks ignore texture awareness, and require 3D data as ground-truth, which limits the scale and diversity of their evaluation set. To address these issues, we introduce Feat2GS, which readout 3D Gaussians attributes from VFM features extracted from unposed images. This allows us to probe 3D awareness for geometry and texture via novel view synthesis, without requiring 3D data. Additionally, the disentanglement of 3DGS parameters - geometry ($\boldsymbol{x}, α, Σ$) and texture ($\boldsymbol{c}$) - enables separate analysis of texture and geometry awareness. Under Feat2GS, we conduct extensive experiments to probe the 3D awareness of several VFMs, and investigate the ingredients that lead to a 3D aware VFM. Building on these findings, we develop several variants that achieve state-of-the-art across diverse datasets. This makes Feat2GS useful for probing VFMs, and as a simple-yet-effective baseline for novel-view synthesis. Code and data will be made available at https://fanegg.github.io/Feat2GS/.

Feat2GS: Probing Visual Foundation Models with Gaussian Splatting

TL;DR

Visual foundation models trained primarily on 2D data often lack explicit 3D texture handling. Feat2GS maps frozen VFM features to a dense 3D Gaussian Splatting representation and uses novel-view synthesis as a dense 3D proxy, avoiding ground-truth 3D labels. The study reveals that modern VFMs generally capture geometry well but have limited texture awareness, with improvements when texture-preserving pretraining (e.g., MAE) or 3D data is leveraged, and gains from simple feature ensembling. Overall, Feat2GS serves as a practical probing tool and a competitive baseline for NVS, guiding future development of 3D-aware visual foundation models.

Abstract

Given that visual foundation models (VFMs) are trained on extensive datasets but often limited to 2D images, a natural question arises: how well do they understand the 3D world? With the differences in architecture and training protocols (i.e., objectives, proxy tasks), a unified framework to fairly and comprehensively probe their 3D awareness is urgently needed. Existing works on 3D probing suggest single-view 2.5D estimation (e.g., depth and normal) or two-view sparse 2D correspondence (e.g., matching and tracking). Unfortunately, these tasks ignore texture awareness, and require 3D data as ground-truth, which limits the scale and diversity of their evaluation set. To address these issues, we introduce Feat2GS, which readout 3D Gaussians attributes from VFM features extracted from unposed images. This allows us to probe 3D awareness for geometry and texture via novel view synthesis, without requiring 3D data. Additionally, the disentanglement of 3DGS parameters - geometry () and texture () - enables separate analysis of texture and geometry awareness. Under Feat2GS, we conduct extensive experiments to probe the 3D awareness of several VFMs, and investigate the ingredients that lead to a 3D aware VFM. Building on these findings, we develop several variants that achieve state-of-the-art across diverse datasets. This makes Feat2GS useful for probing VFMs, and as a simple-yet-effective baseline for novel-view synthesis. Code and data will be made available at https://fanegg.github.io/Feat2GS/.

Paper Structure

This paper contains 15 sections, 6 equations, 16 figures, 14 tables.

Figures (16)

  • Figure 1: Texture+Geometry probing of mainstream VFMs. Normalized average metrics for novel view synthesis (NVS) across six datasets are plotted on axes, with higher values away from the center indicating better performance. Try the interactive visualization demo on https://fanegg.github.io/Feat2GS/#chart.
  • Figure 2: Qualitative Examples. We compare novel view renderings across VFM features. In Geometry mode, the multi-teacher-distillation method (RADIO) and point-regression-based methods (MASt3R, DUSt3R) produce more plausible geometry, e.g., vehicle front and the wheel, indicating better multi-view consistency. All VFM features struggle in Texture mode, and renderings in the All mode are notably blurred, both reflecting the limited texture awareness of current VFMs.
  • Figure 3: Novel View Synthesis as Proxy Task to Assess 3D. We present qualitative examples from the DTU dataset, including NVS, Pointcloud (readout 3DGS positions), Accuracy (smallest distance from a readout point to ground-truth), Completeness (smallest distance from a ground-truth point to a readout point), and Distance (based on ground-truth point matching). Results show that NVS quality aligns with 3D metrics, proving its reliability as an indicator for 3D assessment. RADIO performs best, SD worst, with IUVRGB as a reference. Zoom in or check our https://fanegg.github.io/Feat2GS/#dtu to see more details.
  • Figure 4: GTA Modes Comparison for the Same Region. We present novel view synthesis of GTA modes using RADIO features. Texture mode shows broken structures as it excludes VFM features for 3DGS geometry regression, while All mode is blurrier than Geometry mode due to reliance on VFM features for color regression. This highlights that the blurriness in the All mode arises from the lack of texture awareness in VFMs.
  • Figure 5: Performance Correlations of GTA across All Datasets. The All mode correlates strongly with Geometry mode in PSNR and SSIM (primarily reflect structural consistency), and is closely related to Texture mode in LPIPS (commonly used to assess image sharpness), suggesting an optimal All mode depends on both high-performing Geometry and Texture mode.
  • ...and 11 more figures