Table of Contents
Fetching ...

Coca-Splat: Collaborative Optimization for Camera Parameters and 3D Gaussians

Jiamin Wu, Hongyang Li, Xiaoke Jiang, Yuan Yao, Lei Zhang

TL;DR

Coca-Splat tackles sparse-view, pose-free novel view synthesis by jointly optimizing 3D Gaussians and camera parameters in a single network. It introduces CaMDFA, a camera-aware multi-view deformable cross-attention mechanism, and RefRay, a ray-based representation using Plücker coordinates to connect camera parameters to 3D Gaussians through shared 2D reference points, enabling $RQ$-decomposition-based estimation without input poses. The approach achieves state-of-the-art pose-free performance on RealEstate10K and ACID, outperforming prior pose-free methods and rivaling pose-required approaches, while remaining efficient and scalable to multiple views. These results underscore the practical impact of integrating camera geometry with 3D representations for robust, unposed scene reconstruction and NVS in real-world settings.

Abstract

In this work, we introduce Coca-Splat, a novel approach to addressing the challenges of sparse view pose-free scene reconstruction and novel view synthesis (NVS) by jointly optimizing camera parameters with 3D Gaussians. Inspired by deformable DEtection TRansformer, we design separate queries for 3D Gaussians and camera parameters and update them layer by layer through deformable Transformer layers, enabling joint optimization in a single network. This design demonstrates better performance because to accurately render views that closely approximate ground-truth images relies on precise estimation of both 3D Gaussians and camera parameters. In such a design, the centers of 3D Gaussians are projected onto each view by camera parameters to get projected points, which are regarded as 2D reference points in deformable cross-attention. With camera-aware multi-view deformable cross-attention (CaMDFA), 3D Gaussians and camera parameters are intrinsically connected by sharing the 2D reference points. Additionally, 2D reference point determined rays (RayRef) defined from camera centers to the reference points assist in modeling relationship between 3D Gaussians and camera parameters through RQ-decomposition on an overdetermined system of equations derived from the rays, enhancing the relationship between 3D Gaussians and camera parameters. Extensive evaluation shows that our approach outperforms previous methods, both pose-required and pose-free, on RealEstate10K and ACID within the same pose-free setting.

Coca-Splat: Collaborative Optimization for Camera Parameters and 3D Gaussians

TL;DR

Coca-Splat tackles sparse-view, pose-free novel view synthesis by jointly optimizing 3D Gaussians and camera parameters in a single network. It introduces CaMDFA, a camera-aware multi-view deformable cross-attention mechanism, and RefRay, a ray-based representation using Plücker coordinates to connect camera parameters to 3D Gaussians through shared 2D reference points, enabling -decomposition-based estimation without input poses. The approach achieves state-of-the-art pose-free performance on RealEstate10K and ACID, outperforming prior pose-free methods and rivaling pose-required approaches, while remaining efficient and scalable to multiple views. These results underscore the practical impact of integrating camera geometry with 3D representations for robust, unposed scene reconstruction and NVS in real-world settings.

Abstract

In this work, we introduce Coca-Splat, a novel approach to addressing the challenges of sparse view pose-free scene reconstruction and novel view synthesis (NVS) by jointly optimizing camera parameters with 3D Gaussians. Inspired by deformable DEtection TRansformer, we design separate queries for 3D Gaussians and camera parameters and update them layer by layer through deformable Transformer layers, enabling joint optimization in a single network. This design demonstrates better performance because to accurately render views that closely approximate ground-truth images relies on precise estimation of both 3D Gaussians and camera parameters. In such a design, the centers of 3D Gaussians are projected onto each view by camera parameters to get projected points, which are regarded as 2D reference points in deformable cross-attention. With camera-aware multi-view deformable cross-attention (CaMDFA), 3D Gaussians and camera parameters are intrinsically connected by sharing the 2D reference points. Additionally, 2D reference point determined rays (RayRef) defined from camera centers to the reference points assist in modeling relationship between 3D Gaussians and camera parameters through RQ-decomposition on an overdetermined system of equations derived from the rays, enhancing the relationship between 3D Gaussians and camera parameters. Extensive evaluation shows that our approach outperforms previous methods, both pose-required and pose-free, on RealEstate10K and ACID within the same pose-free setting.

Paper Structure

This paper contains 25 sections, 2 theorems, 24 equations, 15 figures, 9 tables.

Key Result

Theorem 1

Cameras to rays mapping. Given camera center $\mathbf{c} \in \mathbb{R}^{3}$, rotation matrix $\mathbf{R} \in \mathbb{R}^{3 \times 3}$, translation $\mathbf{t} \in \mathbb{R}^3$, camera intrinsics $\mathbf{K} \in\mathbb{R}^{3 \times 3}$, and a point $\mathbf{x} \in \mathbb{R}^3$ in 3D with coordina

Figures (15)

  • Figure 1: Coca-Splat. Given sparse unposed images, our method reconstructs 3D Gaussians and camera rays using a feed-forward network. Subsequently, camera parameters are derived from camera rays, and novel views are rendered from 3D Gaussians.
  • Figure 2: Overall Framework: Our model employs an encoder-decoder architecture to simultaneously reconstruct 3D Gaussians and camera parameters. The Vision Transformer (ViT) encoder processes all input images, with the resulting image features serving as keys and values in CaMDFA block (\ref{['sec: dfa']}) within the decoder layer. In our approach, define queries for 3D Gaussians and camera queries separately, incorporating them into the deformable decoder layer alongside the image features. Following $L$ decoder layers, the model produces queries for 3D Gaussians and cameras, which are used to go through a fully connected network (FFN) operation to regress the 3D Gaussians and RefRay (in Plücker coordinates plucker denoting the direction and moment \ref{['sec: rays2cam']}) that points from camera center to reference points. Subsequently, the predicted 3D Gaussians undergo rendering via Gaussian Splatting 3d-gs to generate the novel views while the rays are utilized to solve the camera parameters in \ref{['sec: rays2cam']}.
  • Figure 3: Our model projects the center of 3D Gaussians onto each input view by the camera parameters to get the reference points. The reference points are utilized to generate RefRay described in \ref{['sec: rays2cam']}, which are the rays from the camera center to reference points. The 3D Gaussians and camera parameters are then explicitly linked by the reference points.
  • Figure 4: Qualitative comparison on RE10K and ACID datasets. the 'L', 'M', and 'S' in the brackets meaning the groups of overlapping large, medium, and small, respectively.
  • Figure 5: Geometry visualization with renderd novel view and depth. Our model demonstrates enhanced performance in addressing specific artifacts in the first row. Furthermore, our model excels in reconstructing missing elements that were challenging for prior approaches, (like chairs and tables situated in the corner), indicated by the blue arrows.
  • ...and 10 more figures

Theorems & Definitions (4)

  • Theorem 1
  • proof
  • Theorem 2
  • proof