DualRefine: Self-Supervised Depth and Pose Estimation Through Iterative Epipolar Sampling and Refinement Toward Equilibrium

Antyanta Bangunharcana; Ahmed Magd; Kyung-Soo Kim

DualRefine: Self-Supervised Depth and Pose Estimation Through Iterative Epipolar Sampling and Refinement Toward Equilibrium

Antyanta Bangunharcana, Ahmed Magd, Kyung-Soo Kim

TL;DR

Self-supervised monocular depth learning often suffers from pose-induced epipolar errors and dynamic scenes. DualRefine solves this by jointly refining depth $D$ and pose $T$ in a DEQ framework, using iterative, epipolar-guided sampling of local costs and direct feature-alignments to drive both quantities toward a fixed point. Depth updates inform pose refinements, and the evolving pose updates continuously reshape the epipolar geometry, improving matching costs and geometric consistency. On KITTI, DualRefine achieves competitive depth accuracy and markedly better odometry than prior self-supervised baselines, while maintaining memory efficiency through local, fixed-point optimization rather than full 3D cost volumes.

Abstract

Self-supervised multi-frame depth estimation achieves high accuracy by computing matching costs of pixel correspondences between adjacent frames, injecting geometric information into the network. These pixel-correspondence candidates are computed based on the relative pose estimates between the frames. Accurate pose predictions are essential for precise matching cost computation as they influence the epipolar geometry. Furthermore, improved depth estimates can, in turn, be used to align pose estimates. Inspired by traditional structure-from-motion (SfM) principles, we propose the DualRefine model, which tightly couples depth and pose estimation through a feedback loop. Our novel update pipeline uses a deep equilibrium model framework to iteratively refine depth estimates and a hidden state of feature maps by computing local matching costs based on epipolar geometry. Importantly, we used the refined depth estimates and feature maps to compute pose updates at each step. This update in the pose estimates slowly alters the epipolar geometry during the refinement process. Experimental results on the KITTI dataset demonstrate competitive depth prediction and odometry prediction performance surpassing published self-supervised baselines.

DualRefine: Self-Supervised Depth and Pose Estimation Through Iterative Epipolar Sampling and Refinement Toward Equilibrium

TL;DR

Self-supervised monocular depth learning often suffers from pose-induced epipolar errors and dynamic scenes. DualRefine solves this by jointly refining depth

and pose

in a DEQ framework, using iterative, epipolar-guided sampling of local costs and direct feature-alignments to drive both quantities toward a fixed point. Depth updates inform pose refinements, and the evolving pose updates continuously reshape the epipolar geometry, improving matching costs and geometric consistency. On KITTI, DualRefine achieves competitive depth accuracy and markedly better odometry than prior self-supervised baselines, while maintaining memory efficiency through local, fixed-point optimization rather than full 3D cost volumes.

Abstract

Paper Structure (29 sections, 9 equations, 8 figures, 9 tables)

This paper contains 29 sections, 9 equations, 8 figures, 9 tables.

Introduction
Related Work
Depth from a single image
Depth from multiple frames
Iterative refinements
Pose estimation
Method
Self-supervised depth and pose
Monocular model
Deep equilibrium alignments
Depth updates around local neighborhood
Feature-metric pose alignments
DEQ training
Experiments
Dataset and metrics
...and 14 more sections

Figures (8)

Figure 1: (a) The estimated pose of a camera affects the epipolar geometry. (b) The epipolar line in the source image, calculated from yellow points in the target image, for the PoseNet-based kendall2015posenet initial pose regression (red) and our refined pose (green). The yellow point in the source image is calculated based on our final depth and pose estimates.
Figure 2: (a) The overall pipeline of the model. Given a pair of source and target images, the teacher model predicts an initial depth $D_0$ and pose $T_0$, as well as initial hidden states that will be updated. DEQ-based alignments are then performed to find the fixed point and output the final predictions. (b) Each iteration in the update step takes the current depth and pose estimates. Matching costs are sampled along the current epipolar lines that evolves based on the pose estimates. The updates are computed by Conv-GRU. Then feature-metric alignment is used to obtain a geometrically consistent pose update.
Figure 3: Qualitative results on KITTI data. $I_s$: input image; $W_q$, $W_{h,0}$, and $W_{h,5}$: confidence weights; $D_0$, $D_5$: disparity estimates; The Abs Rel error for the depth estimates.
Figure 4: Estimated trajectory by the initial pose estimator and the refined trajectory using our pose refinement module on (a) Seq. 09 and (b) Seq. 10 of KITTI odometry data. The refined pose estimate improves the global trajectory, even without explicitly training for global consistency.
Figure 5: The progression of Abs Rel errors in each DualRefine iteration for KITTI depth.
...and 3 more figures

DualRefine: Self-Supervised Depth and Pose Estimation Through Iterative Epipolar Sampling and Refinement Toward Equilibrium

TL;DR

Abstract

DualRefine: Self-Supervised Depth and Pose Estimation Through Iterative Epipolar Sampling and Refinement Toward Equilibrium

Authors

TL;DR

Abstract

Table of Contents

Figures (8)