Coca-Splat: Collaborative Optimization for Camera Parameters and 3D Gaussians
Jiamin Wu, Hongyang Li, Xiaoke Jiang, Yuan Yao, Lei Zhang
TL;DR
Coca-Splat tackles sparse-view, pose-free novel view synthesis by jointly optimizing 3D Gaussians and camera parameters in a single network. It introduces CaMDFA, a camera-aware multi-view deformable cross-attention mechanism, and RefRay, a ray-based representation using Plücker coordinates to connect camera parameters to 3D Gaussians through shared 2D reference points, enabling $RQ$-decomposition-based estimation without input poses. The approach achieves state-of-the-art pose-free performance on RealEstate10K and ACID, outperforming prior pose-free methods and rivaling pose-required approaches, while remaining efficient and scalable to multiple views. These results underscore the practical impact of integrating camera geometry with 3D representations for robust, unposed scene reconstruction and NVS in real-world settings.
Abstract
In this work, we introduce Coca-Splat, a novel approach to addressing the challenges of sparse view pose-free scene reconstruction and novel view synthesis (NVS) by jointly optimizing camera parameters with 3D Gaussians. Inspired by deformable DEtection TRansformer, we design separate queries for 3D Gaussians and camera parameters and update them layer by layer through deformable Transformer layers, enabling joint optimization in a single network. This design demonstrates better performance because to accurately render views that closely approximate ground-truth images relies on precise estimation of both 3D Gaussians and camera parameters. In such a design, the centers of 3D Gaussians are projected onto each view by camera parameters to get projected points, which are regarded as 2D reference points in deformable cross-attention. With camera-aware multi-view deformable cross-attention (CaMDFA), 3D Gaussians and camera parameters are intrinsically connected by sharing the 2D reference points. Additionally, 2D reference point determined rays (RayRef) defined from camera centers to the reference points assist in modeling relationship between 3D Gaussians and camera parameters through RQ-decomposition on an overdetermined system of equations derived from the rays, enhancing the relationship between 3D Gaussians and camera parameters. Extensive evaluation shows that our approach outperforms previous methods, both pose-required and pose-free, on RealEstate10K and ACID within the same pose-free setting.
