Table of Contents
Fetching ...

VicaSplat: A Single Run is All You Need for 3D Gaussian Splatting and Camera Estimation from Unposed Video Frames

Zhiqi Li, Chengrui Dong, Yiming Chen, Zhangchi Huang, Peidong Liu

TL;DR

VicaSplat tackles the challenge of reconstructing a 3D scene as pixel-aligned 3D Gaussian splats while simultaneously estimating per-frame camera poses from unposed video frames. It introduces a transformer-based encoder–decoder with learnable camera tokens, bidirectional video-camera attention, and framewise modulation to fuse multi-view information in a single forward pass. A dual-quaternion pose representation and a novel alignment loss improve pose estimation, while progressive multi-view training and 3D priors distilled from pretrained models enable robust generalization, including cross-dataset performance on ScanNet without fine-tuning. The approach achieves competitive novel view synthesis with multiple views, outperforms baselines in multi-view scenarios, and delivers strong camera pose estimates, offering a practical, pose-free solution for real-world 3D reconstruction from unposed sequences.

Abstract

We present VicaSplat, a novel framework for joint 3D Gaussians reconstruction and camera pose estimation from a sequence of unposed video frames, which is a critical yet underexplored task in real-world 3D applications. The core of our method lies in a novel transformer-based network architecture. In particular, our model starts with an image encoder that maps each image to a list of visual tokens. All visual tokens are concatenated with additional inserted learnable camera tokens. The obtained tokens then fully communicate with each other within a tailored transformer decoder. The camera tokens causally aggregate features from visual tokens of different views, and further modulate them frame-wisely to inject view-dependent features. 3D Gaussian splats and camera pose parameters can then be estimated via different prediction heads. Experiments show that VicaSplat surpasses baseline methods for multi-view inputs, and achieves comparable performance to prior two-view approaches. Remarkably, VicaSplat also demonstrates exceptional cross-dataset generalization capability on the ScanNet benchmark, achieving superior performance without any fine-tuning. Project page: https://lizhiqi49.github.io/VicaSplat.

VicaSplat: A Single Run is All You Need for 3D Gaussian Splatting and Camera Estimation from Unposed Video Frames

TL;DR

VicaSplat tackles the challenge of reconstructing a 3D scene as pixel-aligned 3D Gaussian splats while simultaneously estimating per-frame camera poses from unposed video frames. It introduces a transformer-based encoder–decoder with learnable camera tokens, bidirectional video-camera attention, and framewise modulation to fuse multi-view information in a single forward pass. A dual-quaternion pose representation and a novel alignment loss improve pose estimation, while progressive multi-view training and 3D priors distilled from pretrained models enable robust generalization, including cross-dataset performance on ScanNet without fine-tuning. The approach achieves competitive novel view synthesis with multiple views, outperforms baselines in multi-view scenarios, and delivers strong camera pose estimates, offering a practical, pose-free solution for real-world 3D reconstruction from unposed sequences.

Abstract

We present VicaSplat, a novel framework for joint 3D Gaussians reconstruction and camera pose estimation from a sequence of unposed video frames, which is a critical yet underexplored task in real-world 3D applications. The core of our method lies in a novel transformer-based network architecture. In particular, our model starts with an image encoder that maps each image to a list of visual tokens. All visual tokens are concatenated with additional inserted learnable camera tokens. The obtained tokens then fully communicate with each other within a tailored transformer decoder. The camera tokens causally aggregate features from visual tokens of different views, and further modulate them frame-wisely to inject view-dependent features. 3D Gaussian splats and camera pose parameters can then be estimated via different prediction heads. Experiments show that VicaSplat surpasses baseline methods for multi-view inputs, and achieves comparable performance to prior two-view approaches. Remarkably, VicaSplat also demonstrates exceptional cross-dataset generalization capability on the ScanNet benchmark, achieving superior performance without any fine-tuning. Project page: https://lizhiqi49.github.io/VicaSplat.

Paper Structure

This paper contains 33 sections, 6 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of the proposed method. The model employs a transformer encoder to convert video frames into visual tokens, while a custom transformer decoder with learnable camera tokens processes these representations. Two dedicated prediction heads then predict camera poses and 3D Gaussians respectively.
  • Figure 2: One block of VicaSplat decoder.
  • Figure 3: Predicted 3D Gaussian splats and camera poses (top row) as well as rendered novel view RGBs and depths (bottom row). Our model can jointly reconstruct 3D Gaussians and recover camera extrinsic parameters through a single forward pass. High-fidely RGB images and depth maps can be rendered from novel views.
  • Figure 4: Qualitative comparison of novel view synthesis on RealEstate10k test set with 8 input images. Our model produces the best rendering details and geometry accuracy compared to all baseline methods.
  • Figure 5: Ablations on novel view synthesis. Without our proposed framewise modulation and cross-neighbor attention layer, there is a obvious degeneration on the geometry prediction, and it presents severe blur in the rendered images.