Table of Contents
Fetching ...

CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image

Wonseok Roh, Hwanhee Jung, Jong Wook Kim, Seunggwan Lee, Innfarn Yoo, Andreas Lugmayr, Seunggeun Chi, Karthik Ramani, Sangpil Kim

TL;DR

CATSplat addresses the challenge of reconstructing 3D scenes from a single view by introducing two guiding priors into a transformer-based 3D Gaussian Splatting framework. Textual context from a visual-language model and spatial cues from backprojected 3D point features are integrated via cross-attention to enrich image features and enable robust 3D Gaussian predictions. The method achieves state-of-the-art performance for single-view novel-view synthesis on RealEstate10K and demonstrates strong cross-dataset generalization, including indoor and outdoor scenes. This approach advances monocular 3D reconstruction by leveraging multimodal knowledge and 3D priors, enabling high-quality rendering from previously unseen viewpoints with practical implications for AR/VR and robotics.

Abstract

Recently, generalizable feed-forward methods based on 3D Gaussian Splatting have gained significant attention for their potential to reconstruct 3D scenes using finite resources. These approaches create a 3D radiance field, parameterized by per-pixel 3D Gaussian primitives, from just a few images in a single forward pass. However, unlike multi-view methods that benefit from cross-view correspondences, 3D scene reconstruction with a single-view image remains an underexplored area. In this work, we introduce CATSplat, a novel generalizable transformer-based framework designed to break through the inherent constraints in monocular settings. First, we propose leveraging textual guidance from a visual-language model to complement insufficient information from a single image. By incorporating scene-specific contextual details from text embeddings through cross-attention, we pave the way for context-aware 3D scene reconstruction beyond relying solely on visual cues. Moreover, we advocate utilizing spatial guidance from 3D point features toward comprehensive geometric understanding under single-view settings. With 3D priors, image features can capture rich structural insights for predicting 3D Gaussians without multi-view techniques. Extensive experiments on large-scale datasets demonstrate the state-of-the-art performance of CATSplat in single-view 3D scene reconstruction with high-quality novel view synthesis.

CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image

TL;DR

CATSplat addresses the challenge of reconstructing 3D scenes from a single view by introducing two guiding priors into a transformer-based 3D Gaussian Splatting framework. Textual context from a visual-language model and spatial cues from backprojected 3D point features are integrated via cross-attention to enrich image features and enable robust 3D Gaussian predictions. The method achieves state-of-the-art performance for single-view novel-view synthesis on RealEstate10K and demonstrates strong cross-dataset generalization, including indoor and outdoor scenes. This approach advances monocular 3D reconstruction by leveraging multimodal knowledge and 3D priors, enabling high-quality rendering from previously unseen viewpoints with practical implications for AR/VR and robotics.

Abstract

Recently, generalizable feed-forward methods based on 3D Gaussian Splatting have gained significant attention for their potential to reconstruct 3D scenes using finite resources. These approaches create a 3D radiance field, parameterized by per-pixel 3D Gaussian primitives, from just a few images in a single forward pass. However, unlike multi-view methods that benefit from cross-view correspondences, 3D scene reconstruction with a single-view image remains an underexplored area. In this work, we introduce CATSplat, a novel generalizable transformer-based framework designed to break through the inherent constraints in monocular settings. First, we propose leveraging textual guidance from a visual-language model to complement insufficient information from a single image. By incorporating scene-specific contextual details from text embeddings through cross-attention, we pave the way for context-aware 3D scene reconstruction beyond relying solely on visual cues. Moreover, we advocate utilizing spatial guidance from 3D point features toward comprehensive geometric understanding under single-view settings. With 3D priors, image features can capture rich structural insights for predicting 3D Gaussians without multi-view techniques. Extensive experiments on large-scale datasets demonstrate the state-of-the-art performance of CATSplat in single-view 3D scene reconstruction with high-quality novel view synthesis.

Paper Structure

This paper contains 27 sections, 7 equations, 17 figures, 10 tables, 1 algorithm.

Figures (17)

  • Figure 1: Overview of the generalizable 3D scene reconstruction pipeline. The feed-forward network creates a 3D radiance field using 3D Gaussians, all within an end-to-end differentiable system.
  • Figure 2: We introduce CATSplat, a Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from a single image. (a) Our two main priors, and (b) Examples of text descriptions (from the VLM) representing an input image.
  • Figure 3: Overview of CATSplat framework. CATSplat takes an image $\mathcal{I}$ and predicts 3D Gaussian primitives $\{(\bm{\mu}_j, \bm{\alpha}_j, \bm{\Sigma}_j, \bm{c}_j )\}^{J}_{j}$ to construct a scene-representative 3D radiance field in a single forward pass. In this paradigm, our primary goal is to go beyond the finite knowledge inherent in a single image with our two innovative priors. Through cross-attention layers, we enhance image features $F^\mathcal{I}_i$ to be highly informative by incorporating valuable insights: contextual cues from text features $F^C_i$, and spatial cues from 3D point features $F^S_i$.
  • Figure 4: Detailed transformer pipeline. In the $i$-th layer, we first operate cross-attention between $F_i^{\mathcal{I}}$ and $F_i^C$, then proceed cross-attention with $F_i^S$. We also use a ratio $\gamma$ to preserve visual information from $F_i^{\mathcal{I}}$ while incorporating extra cues from $F_i^C$ and $F_i^S$.
  • Figure 5: Ablation study to see the effect of iteratively incorporating our novel priors on the RE10K zhou2018stereo ($n$=Random). For clear ablations, we keep the number of entire transformer layers consistent across the experiments and adjust only the number of cross-attentions (CA).
  • ...and 12 more figures