Table of Contents
Fetching ...

Improving Robustness for Joint Optimization of Camera Poses and Decomposed Low-Rank Tensorial Radiance Fields

Bo-Yu Cheng, Wei-Chen Chiu, Yu-Lun Liu

TL;DR

This work addresses robust joint optimization of camera poses and a 3D scene represented by decomposed low-rank tensors using only 2D supervision, noting that naive voxel-based pose optimization can converge to sub-optimal minima due to high-frequency content. It introduces a spectrum-control framework built on separable component-wise Gaussian convolution over decomposed tensors, enabling a coarse-to-fine training regime, plus robustness techniques including smoothed 2D supervision, randomly scaled kernels, and edge-guided loss. A key contribution is an efficient separable convolution approach that distributes 3D Gaussian filtering across tensor components, achieving significant computational savings while preserving expressivity, and allowing a single voxel grid to be trained with accelerated convergence. Empirically, the method delivers state-of-the-art novel view synthesis and robust pose recovery on NeRF-Synthetic and LLFF datasets, converging an order of magnitude faster than prior methods that require hundreds of thousands of iterations. Overall, the paper advances robust joint optimization for voxel-based radiance fields with decomposed representations, making unknown-pose 3D reconstruction more practical and scalable.

Abstract

In this paper, we propose an algorithm that allows joint refinement of camera pose and scene geometry represented by decomposed low-rank tensor, using only 2D images as supervision. First, we conduct a pilot study based on a 1D signal and relate our findings to 3D scenarios, where the naive joint pose optimization on voxel-based NeRFs can easily lead to sub-optimal solutions. Moreover, based on the analysis of the frequency spectrum, we propose to apply convolutional Gaussian filters on 2D and 3D radiance fields for a coarse-to-fine training schedule that enables joint camera pose optimization. Leveraging the decomposition property in decomposed low-rank tensor, our method achieves an equivalent effect to brute-force 3D convolution with only incurring little computational overhead. To further improve the robustness and stability of joint optimization, we also propose techniques of smoothed 2D supervision, randomly scaled kernel parameters, and edge-guided loss mask. Extensive quantitative and qualitative evaluations demonstrate that our proposed framework achieves superior performance in novel view synthesis as well as rapid convergence for optimization.

Improving Robustness for Joint Optimization of Camera Poses and Decomposed Low-Rank Tensorial Radiance Fields

TL;DR

This work addresses robust joint optimization of camera poses and a 3D scene represented by decomposed low-rank tensors using only 2D supervision, noting that naive voxel-based pose optimization can converge to sub-optimal minima due to high-frequency content. It introduces a spectrum-control framework built on separable component-wise Gaussian convolution over decomposed tensors, enabling a coarse-to-fine training regime, plus robustness techniques including smoothed 2D supervision, randomly scaled kernels, and edge-guided loss. A key contribution is an efficient separable convolution approach that distributes 3D Gaussian filtering across tensor components, achieving significant computational savings while preserving expressivity, and allowing a single voxel grid to be trained with accelerated convergence. Empirically, the method delivers state-of-the-art novel view synthesis and robust pose recovery on NeRF-Synthetic and LLFF datasets, converging an order of magnitude faster than prior methods that require hundreds of thousands of iterations. Overall, the paper advances robust joint optimization for voxel-based radiance fields with decomposed representations, making unknown-pose 3D reconstruction more practical and scalable.

Abstract

In this paper, we propose an algorithm that allows joint refinement of camera pose and scene geometry represented by decomposed low-rank tensor, using only 2D images as supervision. First, we conduct a pilot study based on a 1D signal and relate our findings to 3D scenarios, where the naive joint pose optimization on voxel-based NeRFs can easily lead to sub-optimal solutions. Moreover, based on the analysis of the frequency spectrum, we propose to apply convolutional Gaussian filters on 2D and 3D radiance fields for a coarse-to-fine training schedule that enables joint camera pose optimization. Leveraging the decomposition property in decomposed low-rank tensor, our method achieves an equivalent effect to brute-force 3D convolution with only incurring little computational overhead. To further improve the robustness and stability of joint optimization, we also propose techniques of smoothed 2D supervision, randomly scaled kernel parameters, and edge-guided loss mask. Extensive quantitative and qualitative evaluations demonstrate that our proposed framework achieves superior performance in novel view synthesis as well as rapid convergence for optimization.
Paper Structure (24 sections, 4 theorems, 16 equations, 7 figures, 8 tables)

This paper contains 24 sections, 4 theorems, 16 equations, 7 figures, 8 tables.

Key Result

Theorem 1

If we assume rapid convergence of signal $g$ (which means $g$ achieves local optima $g^*$ w.r.t current $q_1, q_2$ whenever we update $q_1, q_2$.), we find that the problem in Eq.eq:1d_loss is equivalent to pure alignment between two ground-truth signals, that is where $u = (p_1-p_2) - (q_1-q_2)$ is the shift between two ground truth signals, which has an initial value of $p_1 - p_2$

Figures (7)

  • Figure 1: Robust joint pose refinement on decomposed tensor. Our method enables joint optimization of camera poses and decomposed voxel representation by applying efficient separable component-wise convolution of Gaussian filters on 3D tensor volume and 2D supervision images.
  • Figure 2: Comparison of naive joint pose optimization and our proposed method on voxel-based NeRFs. (a) Naively applying joint optimization on voxel-based NeRFs leads to dramatic failure as premature high-frequency signals in the voxel volume would curse the camera poses to stuck in local minima. (b) We propose a computationally effective manner to directly control the spectrum of the radiance field by performing separable component-wise convolution of Gaussian filters on the decomposed tensor. The proposed training scheme allows the joint optimization to converge successfully to a better solution.
  • Figure 3: Spectrum analysis and effect of Gaussian filtering on 1D signal alignment. (a) 1D signal alignment comparison: noisy signals can get trapped in local optima without Gaussian filtering. (b)(Top) Visualization of $H(u, k)$ in Eq. \ref{['eq:1d_transfer']}, which shows alternating signs as $k$ departs from $0$, causing misdirection in gradient-based optimization if there has too much high-frequency energy in the signal. (b)(Bottom) Visualization of $\Tilde{H}(u, k)$ in Eq. \ref{['eq:1D_filter']}, which is the modulated version of $H(u, k)$ with the help of Gaussian filter $\mathcal{N}$. (c) 1D alignment relates to 3D joint optimization in Eq. \ref{['eq:joint3D']}, where effective pose refinement stems from the 1D alignment in specific cross-sections, with the red lines in 3D scene correlating to horizontal shifts (blue arrows) and rotations (green arrows).
  • Figure 4: Visualization of 2D Randomly Sampled Kernel and Edge Guided Loss. (a) Input supervision without kernel. Joint optimization using unblurred images easily overfit to high-frequency noises (b) Input supervision blurred by an overly aggressive kernel. Notice that the edge information is largely destroyed by the blurring process, resulting in weak and noisy gradients, causing the poses to drift around easily. (c) Same input supervision blurred by four randomly scaled kernels. We empirically found that mixing different filtering strengths results in a more robust joint optimization. (d) We select edge area of a blurred image by Sobel filter with a threshold set to 1.25x of the average value of the filtered edge-strength map.
  • Figure 5: Qualitative comparisons of the 2D image patch alignment.2D TensoRF + 2D Gaussian successfully registers accurate warping parameters, verifying the analysis of Gaussian filtering on joint optimization.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4