Table of Contents
Fetching ...

PTC-Depth: Pose-Refined Monocular Depth Estimation with Temporal Consistency

Leezy Han, Seunggyu Kim, Dongseok Shim, Hyeonbeom Lee

Abstract

Monocular depth estimation (MDE) has been widely adopted in the perception systems of autonomous vehicles and mobile robots. However, existing approaches often struggle to maintain temporal consistency in depth estimation across consecutive frames. This inconsistency not only causes jitter but can also lead to estimation failures when the depth range changes abruptly. To address these challenges, this paper proposes a consistency-aware monocular depth estimation framework that leverages wheel odometry from a mobile robot to achieve stable and coherent depth predictions over time. Specifically, we estimate camera pose and sparse depth from triangulation using optical flow between consecutive frames. The sparse depth estimates are used to update a recursive Bayesian estimate of the metric scale, which is then applied to rescale the relative depth predicted by a pre-trained depth estimation foundation model. The proposed method is evaluated on the KITTI, TartanAir, MS2, and our own dataset, demonstrating robust and accurate depth estimation performance.

PTC-Depth: Pose-Refined Monocular Depth Estimation with Temporal Consistency

Abstract

Monocular depth estimation (MDE) has been widely adopted in the perception systems of autonomous vehicles and mobile robots. However, existing approaches often struggle to maintain temporal consistency in depth estimation across consecutive frames. This inconsistency not only causes jitter but can also lead to estimation failures when the depth range changes abruptly. To address these challenges, this paper proposes a consistency-aware monocular depth estimation framework that leverages wheel odometry from a mobile robot to achieve stable and coherent depth predictions over time. Specifically, we estimate camera pose and sparse depth from triangulation using optical flow between consecutive frames. The sparse depth estimates are used to update a recursive Bayesian estimate of the metric scale, which is then applied to rescale the relative depth predicted by a pre-trained depth estimation foundation model. The proposed method is evaluated on the KITTI, TartanAir, MS2, and our own dataset, demonstrating robust and accurate depth estimation performance.

Paper Structure

This paper contains 43 sections, 25 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: The overall framework of the proposed method. Camera pose and metric scale are estimated from optical flow and wheel odometry, respectively, and used to fuse triangulated metric depth with relative depth from a foundation model.
  • Figure 2: Triangulation and temporal depth propagation. The posterior depth $z^{\mathrm{post}}_{i-1}$ from frame $i{-}1$ is lifted to the 3D point $\mathbf{p}_i$, transformed by the estimated pose $(\boldsymbol{\Omega},\boldsymbol{T})$, and projected into frame $i$ to form the temporal prior $z^{\mathrm{prior}}_i$. Independently, triangulation from optical flow provides a new metric observation. Bayesian fusion combines the temporal prior with this observation to produce the refined posterior $z^{\mathrm{post}}_i$, which is recursively propagated to the next frame as $z^{\mathrm{prior}}_{i+1}$.
  • Figure 3: Depth estimation on various datasets. We evaluated our method on both real-world RGB datasets, such as KITTI, and synthetic RGB datasets, such as TartanAir. We also perform the depth estimation on thermal and NIR imagery using the MS2 dataset as well as our newly collected dataset.
  • Figure 4: Comparison result of 3D reconstruction. Compared to conventional depth estimation, our approach produces reliable depth closely aligned to the ground-truth data. Furthermore, when multiple frames are accumulated, the reconstructed point clouds exhibit sufficient geometric consistency.
  • Figure 5: Comparison result of accumulated pointcloud. Through the accumulation of multiple point clouds, our algorithm consistently produces temporally coherent depth estimates.
  • ...and 6 more figures