VicaSplat: A Single Run is All You Need for 3D Gaussian Splatting and Camera Estimation from Unposed Video Frames

Zhiqi Li; Chengrui Dong; Yiming Chen; Zhangchi Huang; Peidong Liu

VicaSplat: A Single Run is All You Need for 3D Gaussian Splatting and Camera Estimation from Unposed Video Frames

Zhiqi Li, Chengrui Dong, Yiming Chen, Zhangchi Huang, Peidong Liu

TL;DR

VicaSplat tackles the challenge of reconstructing a 3D scene as pixel-aligned 3D Gaussian splats while simultaneously estimating per-frame camera poses from unposed video frames. It introduces a transformer-based encoder–decoder with learnable camera tokens, bidirectional video-camera attention, and framewise modulation to fuse multi-view information in a single forward pass. A dual-quaternion pose representation and a novel alignment loss improve pose estimation, while progressive multi-view training and 3D priors distilled from pretrained models enable robust generalization, including cross-dataset performance on ScanNet without fine-tuning. The approach achieves competitive novel view synthesis with multiple views, outperforms baselines in multi-view scenarios, and delivers strong camera pose estimates, offering a practical, pose-free solution for real-world 3D reconstruction from unposed sequences.

Abstract

We present VicaSplat, a novel framework for joint 3D Gaussians reconstruction and camera pose estimation from a sequence of unposed video frames, which is a critical yet underexplored task in real-world 3D applications. The core of our method lies in a novel transformer-based network architecture. In particular, our model starts with an image encoder that maps each image to a list of visual tokens. All visual tokens are concatenated with additional inserted learnable camera tokens. The obtained tokens then fully communicate with each other within a tailored transformer decoder. The camera tokens causally aggregate features from visual tokens of different views, and further modulate them frame-wisely to inject view-dependent features. 3D Gaussian splats and camera pose parameters can then be estimated via different prediction heads. Experiments show that VicaSplat surpasses baseline methods for multi-view inputs, and achieves comparable performance to prior two-view approaches. Remarkably, VicaSplat also demonstrates exceptional cross-dataset generalization capability on the ScanNet benchmark, achieving superior performance without any fine-tuning. Project page: https://lizhiqi49.github.io/VicaSplat.

VicaSplat: A Single Run is All You Need for 3D Gaussian Splatting and Camera Estimation from Unposed Video Frames

TL;DR

Abstract

VicaSplat: A Single Run is All You Need for 3D Gaussian Splatting and Camera Estimation from Unposed Video Frames

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)