CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image

Wonseok Roh; Hwanhee Jung; Jong Wook Kim; Seunggwan Lee; Innfarn Yoo; Andreas Lugmayr; Seunggeun Chi; Karthik Ramani; Sangpil Kim

CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image

Wonseok Roh, Hwanhee Jung, Jong Wook Kim, Seunggwan Lee, Innfarn Yoo, Andreas Lugmayr, Seunggeun Chi, Karthik Ramani, Sangpil Kim

TL;DR

CATSplat addresses the challenge of reconstructing 3D scenes from a single view by introducing two guiding priors into a transformer-based 3D Gaussian Splatting framework. Textual context from a visual-language model and spatial cues from backprojected 3D point features are integrated via cross-attention to enrich image features and enable robust 3D Gaussian predictions. The method achieves state-of-the-art performance for single-view novel-view synthesis on RealEstate10K and demonstrates strong cross-dataset generalization, including indoor and outdoor scenes. This approach advances monocular 3D reconstruction by leveraging multimodal knowledge and 3D priors, enabling high-quality rendering from previously unseen viewpoints with practical implications for AR/VR and robotics.

Abstract

Recently, generalizable feed-forward methods based on 3D Gaussian Splatting have gained significant attention for their potential to reconstruct 3D scenes using finite resources. These approaches create a 3D radiance field, parameterized by per-pixel 3D Gaussian primitives, from just a few images in a single forward pass. However, unlike multi-view methods that benefit from cross-view correspondences, 3D scene reconstruction with a single-view image remains an underexplored area. In this work, we introduce CATSplat, a novel generalizable transformer-based framework designed to break through the inherent constraints in monocular settings. First, we propose leveraging textual guidance from a visual-language model to complement insufficient information from a single image. By incorporating scene-specific contextual details from text embeddings through cross-attention, we pave the way for context-aware 3D scene reconstruction beyond relying solely on visual cues. Moreover, we advocate utilizing spatial guidance from 3D point features toward comprehensive geometric understanding under single-view settings. With 3D priors, image features can capture rich structural insights for predicting 3D Gaussians without multi-view techniques. Extensive experiments on large-scale datasets demonstrate the state-of-the-art performance of CATSplat in single-view 3D scene reconstruction with high-quality novel view synthesis.

CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image

TL;DR

Abstract

CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (17)