Table of Contents
Fetching ...

Revisit Self-supervised Depth Estimation with Local Structure-from-Motion

Shengjie Zhu, Xiaoming Liu

TL;DR

The paper tackles the gap between self-supervised depth estimation and Structure-from-Motion by introducing a local SfM pipeline that operates over a small window (as few as $5$ frames). It replaces learning-through-loss with a Bundle-RANSAC-Adjustment pose optimization and a frustum Radiance Field triangulation with geometric verification to produce a sparse, geometrically verified root-depth while preserving metric scale. The approach yields poses, depth adjustments, and sparse triangulated depths, enabling self-supervised improvements to SoTA supervised models, and achieves state-of-the-art sparse-view pose accuracy and robust self-supervised correspondence estimation on RGB-D data. The work demonstrates practical benefits for temporally consistent depth, AR compositing, and NeRF-style rendering using a non-neural triangulation step, while also providing theoretical extensions via Hough Transform acceleration and accompanying proofs.

Abstract

Both self-supervised depth estimation and Structure-from-Motion (SfM) recover scene depth from RGB videos. Despite sharing a similar objective, the two approaches are disconnected. Prior works of self-supervision backpropagate losses defined within immediate neighboring frames. Instead of learning-through-loss, this work proposes an alternative scheme by performing local SfM. First, with calibrated RGB or RGB-D images, we employ a depth and correspondence estimator to infer depthmaps and pair-wise correspondence maps. Then, a novel bundle-RANSAC-adjustment algorithm jointly optimizes camera poses and one depth adjustment for each depthmap. Finally, we fix camera poses and employ a NeRF, however, without a neural network, for dense triangulation and geometric verification. Poses, depth adjustments, and triangulated sparse depths are our outputs. For the first time, we show self-supervision within $5$ frames already benefits SoTA supervised depth and correspondence models. The project page is held in the link (https://shngjz.github.io/SSfM.github.io/).

Revisit Self-supervised Depth Estimation with Local Structure-from-Motion

TL;DR

The paper tackles the gap between self-supervised depth estimation and Structure-from-Motion by introducing a local SfM pipeline that operates over a small window (as few as frames). It replaces learning-through-loss with a Bundle-RANSAC-Adjustment pose optimization and a frustum Radiance Field triangulation with geometric verification to produce a sparse, geometrically verified root-depth while preserving metric scale. The approach yields poses, depth adjustments, and sparse triangulated depths, enabling self-supervised improvements to SoTA supervised models, and achieves state-of-the-art sparse-view pose accuracy and robust self-supervised correspondence estimation on RGB-D data. The work demonstrates practical benefits for temporally consistent depth, AR compositing, and NeRF-style rendering using a non-neural triangulation step, while also providing theoretical extensions via Hough Transform acceleration and accompanying proofs.

Abstract

Both self-supervised depth estimation and Structure-from-Motion (SfM) recover scene depth from RGB videos. Despite sharing a similar objective, the two approaches are disconnected. Prior works of self-supervision backpropagate losses defined within immediate neighboring frames. Instead of learning-through-loss, this work proposes an alternative scheme by performing local SfM. First, with calibrated RGB or RGB-D images, we employ a depth and correspondence estimator to infer depthmaps and pair-wise correspondence maps. Then, a novel bundle-RANSAC-adjustment algorithm jointly optimizes camera poses and one depth adjustment for each depthmap. Finally, we fix camera poses and employ a NeRF, however, without a neural network, for dense triangulation and geometric verification. Poses, depth adjustments, and triangulated sparse depths are our outputs. For the first time, we show self-supervision within frames already benefits SoTA supervised depth and correspondence models. The project page is held in the link (https://shngjz.github.io/SSfM.github.io/).
Paper Structure (26 sections, 2 theorems, 46 equations, 15 figures, 7 tables)

This paper contains 26 sections, 2 theorems, 46 equations, 15 figures, 7 tables.

Key Result

corollary thmcountercorollary

A pixel is an inlier iff:

Figures (15)

  • Figure 1: Revisit Self-supervision with Local SfM. The work proposes alternating the learning-through-loss with a local SfM pipeline for self-supervised depth estimation. We summarize our differences. On self-supervision: (1) Instead of using naive two-view camera poses, we propose a Bundle-RANSAC-Adjustment pose optimization algorithm with multi-view constraints. (2) Instead of backpropagating through a loss, we produce a sparse point cloud with explicit triangulation and geometric verification. The point cloud serves as either output or pseudo-groundtruth for self-supervision. On SfM: (1) Our local SfM is adapted to use estimated monocular depthmaps and automatically resolve their scale inconsistency between pairs of images. (2) We maintain accuracy under significant sparse view variations, e.g., red trajectories. We generalize SfM to as few as $5$ frames, similar to the number of images used to define self-supervision loss.
  • Figure 1: Two-view Hough Transform on 3D Scoring Function. Pixels $\mathbf{p}_i$ and $\mathbf{p}_j$ are corresponded. Similar to \ref{['fig:hough']}, ablating pose scales map pixel $\mathbf{p}_i$ to a set of 3D rays originated from camera origin, denoted as $\{\hat{\mathbf{l}}_i\}$. To be an inlier, a backprojected 3D point $\hat{\mathbf{p}}_{\pi}$ has to reside within a sphere centered with $\hat{\mathbf{p}}_j$ with a radius $\lambda^{\text{3D}}$, i.e., between 3D segments $\hat{\mathbf{p}}_{\pi}^{\text{st}}$ and $\hat{\mathbf{p}}_{\pi}^{\text{ed}}$. Different from \ref{['fig:hough']}, with fixed normalized poses, there exists four variables to optimize, including the additional frame $j$ depth adjustment $r_j$.
  • Figure 2: Local Structure-from-Motion. With $N$ neighboring frames, we extract monocular depthmaps and pairwise dense correspondence maps with methods, e.g., ZoeDepth bhat2023zoedepth and PDC-Net truong2023pdc. Next, skipping the root frame, we optimize the rest $N-1$ camera poses and depth adjustments. The depth adjustments render input depthmaps temporally consistent. Fixing poses and adjustments, we use the Radiance Field (RF) for triangulation and output a geometrically verified sparse root depthmap. Our local SfM applies self-supervision with only $5$ RGB frames. Yet, our sparse output already outperforms the input supervised depth with SoTA performance.
  • Figure 2: Scores w.r.t Optimization Epochs. The inlier scores always rise throughout the optimization. With $5$ frames, our algorithm terminates on average at $23.5$ epochs. [Key: Inlier Score / Monodepth Estimator / Correspondence Estimator]
  • Figure 3: Algorithm Overview. After extracting monodepths and correspondence maps from inputs: (a) We apply Bundle-RANSAC-Adjustment to optimize $N-1$ camera poses $\mathcal{P}$ and $N - 1$ depth adjustments $\mathcal{R}$. (b) We fix poses and depth adjustments and optimize a frustum Radiance Field (RF) for triangulation. (c) We apply geometric verification to extract multi-view consistent 3D points via rendering with RF. We further detail step (a) in Figs \ref{['fig:pose_ba']}, \ref{['fig:hough']}, and \ref{['fig:ba_real']}, and steps (b) and (c) in \ref{['fig:ablation']}.
  • ...and 10 more figures

Theorems & Definitions (2)

  • corollary thmcountercorollary
  • corollary thmcountercorollary