Table of Contents
Fetching ...

PROFusion: Robust and Accurate Dense Reconstruction via Camera Pose Regression and Optimization

Siyan Dong, Zijun Wang, Lulu Cai, Yi Ma, Yanchao Yang

TL;DR

This work employs a camera pose regression network to predict metric-aware relative poses from consecutive RGB-D frames, which serve as reliable starting points for a randomized optimization algorithm that further aligns depth images with the scene geometry.

Abstract

Real-time dense scene reconstruction during unstable camera motions is crucial for robotics, yet current RGB-D SLAM systems fail when cameras experience large viewpoint changes, fast motions, or sudden shaking. Classical optimization-based methods deliver high accuracy but fail with poor initialization during large motions, while learning-based approaches provide robustness but lack sufficient accuracy for dense reconstruction. We address this challenge through a combination of learning-based initialization with optimization-based refinement. Our method employs a camera pose regression network to predict metric-aware relative poses from consecutive RGB-D frames, which serve as reliable starting points for a randomized optimization algorithm that further aligns depth images with the scene geometry. Extensive experiments demonstrate promising results: our approach outperforms the best competitor on challenging benchmarks, while maintaining comparable accuracy on stable motion sequences. The system operates in real-time, showcasing that combining simple and principled techniques can achieve both robustness for unstable motions and accuracy for dense reconstruction. Code released: https://github.com/siyandong/PROFusion.

PROFusion: Robust and Accurate Dense Reconstruction via Camera Pose Regression and Optimization

TL;DR

This work employs a camera pose regression network to predict metric-aware relative poses from consecutive RGB-D frames, which serve as reliable starting points for a randomized optimization algorithm that further aligns depth images with the scene geometry.

Abstract

Real-time dense scene reconstruction during unstable camera motions is crucial for robotics, yet current RGB-D SLAM systems fail when cameras experience large viewpoint changes, fast motions, or sudden shaking. Classical optimization-based methods deliver high accuracy but fail with poor initialization during large motions, while learning-based approaches provide robustness but lack sufficient accuracy for dense reconstruction. We address this challenge through a combination of learning-based initialization with optimization-based refinement. Our method employs a camera pose regression network to predict metric-aware relative poses from consecutive RGB-D frames, which serve as reliable starting points for a randomized optimization algorithm that further aligns depth images with the scene geometry. Extensive experiments demonstrate promising results: our approach outperforms the best competitor on challenging benchmarks, while maintaining comparable accuracy on stable motion sequences. The system operates in real-time, showcasing that combining simple and principled techniques can achieve both robustness for unstable motions and accuracy for dense reconstruction. Code released: https://github.com/siyandong/PROFusion.

Paper Structure

This paper contains 13 sections, 2 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Dense scene reconstruction under challenging camera motions. Two representative sequences demonstrate failure cases where state-of-the-art methods like ROSEFusion zhang2021rosefusion (left) produce corrupted reconstructions due to unstable camera motions involving large translations and fast in-place rotations. Our approach (right), which combines camera pose regression and optimization, successfully reconstructs accurate scene layouts under challenging conditions, demonstrating superior robustness to camera motion instability.
  • Figure 2: System overview. We use the first frame (set to identity pose) to initialize the scene represented by TSDF grids. The following frames are incrementally fused to the scene through a two-step process: first, a coarse registration with the previous frame via camera pose regression, and second, a fine-grained alignment to the TSDF via a randomized optimization algorithm. The aligned frames then update the TSDF values by modifying known grids and filling in new ones. Through this process, both camera motion and scene geometry are progressively reconstructed.
  • Figure 3: Network architecture. It takes a pair of consecutive RGB-D frames as input and outputs the relative camera pose that aligns the second frame to the first one. The color and depth (converted to metric point clouds) images are divided into tokens and fed into a Transformer backbone with a pose regression head to infer the relative transformation matrix.
  • Figure 4: Illustration of the camera pose searching process in iteration $i$ in our randomized optimization. We multiply a set of delta poses $\{ \Delta P_k^{(i)} \}_{k=1}^K$ to the current pose $P_t^{(i-1)}$ and evaluate their fitness to TSDF$_{t-1}$. Delta poses with better alignment are collected in an advantage set to update the current pose and the search size for the next iteration.
  • Figure 5: Visual comparison between ROSEFusion (the most robust competitor) and our system. We present dense reconstruction results from the four most challenging sequences from FastCaMo-Real. For each sequence, we drop 50%-80% of frames to mimic unstable motion. Our system performs only single-frame camera tracking without bundle adjustment or loop closure, yet the reconstructed layout demonstrates both robustness (no wrong registration) and accuracy (minimal drift).
  • ...and 2 more figures