A Construct-Optimize Approach to Sparse View Synthesis without Camera Pose

Kaiwen Jiang; Yang Fu; Mukund Varma T; Yash Belhe; Xiaolong Wang; Hao Su; Ravi Ramamoorthi

A Construct-Optimize Approach to Sparse View Synthesis without Camera Pose

Kaiwen Jiang, Yang Fu, Mukund Varma T, Yash Belhe, Xiaolong Wang, Hao Su, Ravi Ramamoorthi

TL;DR

This work addresses sparse view synthesis without known camera poses by proposing a construct-and-optimize pipeline that builds a scene with 3D Gaussians using monocular depth estimates and progressively registers and adjusts camera poses and depths. A differentiable surface rendering formulation for Gaussian splatting enables long-range supervision through 2D correspondences, guiding both pose registration and depth alignment to produce a coherent scene. After obtaining a coarse solution, the method applies low-pass filtering and refines with standard 3DGS optimization, achieving state-of-the-art results on Tanks & Temples and Static Hikes with as few as $3$ views and improving as more views are provided. The approach outperforms both pose-free and pose-based baselines, demonstrating strong practical impact for sparse view synthesis and potential extensions to unordered image collections and more robust monocular depth alignment.

Abstract

Novel view synthesis from a sparse set of input images is a challenging problem of great practical interest, especially when camera poses are absent or inaccurate. Direct optimization of camera poses and usage of estimated depths in neural radiance field algorithms usually do not produce good results because of the coupling between poses and depths, and inaccuracies in monocular depth estimation. In this paper, we leverage the recent 3D Gaussian splatting method to develop a novel construct-and-optimize method for sparse view synthesis without camera poses. Specifically, we construct a solution progressively by using monocular depth and projecting pixels back into the 3D world. During construction, we optimize the solution by detecting 2D correspondences between training views and the corresponding rendered images. We develop a unified differentiable pipeline for camera registration and adjustment of both camera poses and depths, followed by back-projection. We also introduce a novel notion of an expected surface in Gaussian splatting, which is critical to our optimization. These steps enable a coarse solution, which can then be low-pass filtered and refined using standard optimization methods. We demonstrate results on the Tanks and Temples and Static Hikes datasets with as few as three widely-spaced views, showing significantly better quality than competing methods, including those with approximate camera pose information. Moreover, our results improve with more views and outperform previous InstantNGP and Gaussian Splatting algorithms even when using half the dataset. Project page: https://raymondjiangkw.github.io/cogs.github.io/

A Construct-Optimize Approach to Sparse View Synthesis without Camera Pose

TL;DR

views and improving as more views are provided. The approach outperforms both pose-free and pose-based baselines, demonstrating strong practical impact for sparse view synthesis and potential extensions to unordered image collections and more robust monocular depth alignment.

Abstract

Paper Structure (18 sections, 8 equations, 9 figures, 4 tables)

This paper contains 18 sections, 8 equations, 9 figures, 4 tables.

Introduction
Related Work
Sparse view synthesis.
Optimizing camera poses in NeRFs.
Surface rendering in Gaussian splatting.
Method
Algorithm Overview
Optimization Framework
Differentiable Surface Rendering
Refinement and Implementation Details
Results
Evaluation Details.
Datasets.
Comparison
Quantitative Evaluation.
...and 3 more sections

Figures (9)

Figure 1: Example of ambiguity given partial views. Given the scene in (a), there could be different possibilities of scene layouts as shown in (b) and (c), if only the first view or second view is observed. (b) or (c) could be the estimated depth. This ambiguity results in unavoidable error in monocular depth estimation, which necessitates the alignment between camera poses and estimated depths.
Figure 2: Overview of our method for sparse view synthesis. We first back-project the first view and sequentially register, adjust and back-project the remaining views in sequence to obtain a coarse solution. This coarse solution is then refined by standard optimization to reproduce fine details.
Figure 3: We assume the first $k$ views have already been registered, and illustrate the registration, adjustment and back-projection of the $k+1$th view. (a) We first initialize the camera pose of the $k+1$th view, denoted as $P_{k+1}$, as the $k$th view's camera pose. 2D correspondences are detected between ground-truth image $I_{k+1}$ and the rendered result $I_\text{render}(P_{k+1})$ at $P_{k+1}$. Correspondence points on $I_\text{render}(P_{k+1})$ are denoted as $\kappa'$, while those on $I_{k+1}$ are denoted as $\kappa$. Green points denote correct correspondences, while red points denote wrong correspondences. We can use perspective-n-points (PnP) to solve the camera pose but it results in an erroneous solution. (b) We then apply our optimization pipeline (Sec. \ref{['methodology-optimization']}) to estimate the camera pose for registration. For now, the monocular depth $D_{k+1}$ of the $k+1$th view deviates significantly from the rendered depth $D_\text{render}(P_{k+1})$ at $P_{k+1}$. (c) Afterwards, we apply our optimization pipeline (Sec. \ref{['methodology-optimization']}) to adjust all previous registered camera poses and monocular depths along with $P_{k+1}$ and $D_{k+1}$. It can be seen that $I_\text{render}(P_{k+1})$ and $D_{k+1}$ are much close to $I_{k+1}$ and $D_\text{render}(P_{k+1})$. Finally, we back-project pixels in the $k+1$th view into world space as 3D Gaussians based on $D_{k+1}$. Images credit by tanks_and_temples.
Figure 4: Illustration of surface rendering in Gaussian splatting. Assume the ray is shot from screen-space coordinates $\mathbf{s}$ and $\Psi(\mathbf{s})$ denotes the rendered surface point. $\pi(\cdot)$ denotes projecting 3D points into screen space. (a) Depth rendering of previous methods. The depth $d$ of a Gaussian kernel is defined as the $z$-axis coordinate for the transformed center $\boldsymbol{\mu}$ in the camera space. (b) Extending (a) to render the exact 3D surface point. The surface point of the Gaussian kernel is defined as the center $\boldsymbol{\mu}$. It could result in a mismatch between $\mathbf{s}$ and $\pi(\Psi(\mathbf{s}))$. (c) Approximate surface rendering of our method. The surface point $\widehat{\boldsymbol{\mu}}(\mathbf{s})$ of the Gaussian kernel is defined as the intersection point between the ray and an ellipsoid shell. Therefore, our method guarantees a match between $\mathbf{s}$ and $\pi(\Psi(\mathbf{s}))$. (d) Surface rendering of our method when considering all the rays passing through the center of a spherical Gaussian kernel. The expected surface points form a shell.
Figure 5: (a) Illustration of the invariance of relative position $\delta$ between the surface point $\widehat{\boldsymbol{\mu}}(\mathbf{s})$, and the center of the Gaussian kernel $\boldsymbol{\mu}$. $\delta$ is expected to be translation- and rotation-invariant. (b) Illustration of re-parameterizing surface point $\widehat{\boldsymbol{\mu}}(\mathbf{s})$ from an intersected point into a point defined on an ellipsoid shell.
...and 4 more figures

A Construct-Optimize Approach to Sparse View Synthesis without Camera Pose

TL;DR

Abstract

A Construct-Optimize Approach to Sparse View Synthesis without Camera Pose

Authors

TL;DR

Abstract

Table of Contents

Figures (9)