Table of Contents
Fetching ...

MEDeA: Multi-view Efficient Depth Adjustment

Mikhail Artemyev, Anna Vorontsova, Anna Sokolova, Alexander Limonov

TL;DR

MEDeA tackles the problem of temporal inconsistency in depth estimation from video by introducing a fast, test-time depth adjustment framework. It combines a pre-trained depth predictor with a lightweight depth deformation model and reprojection-based losses, organized into a two-stage optimization that enforces cross-frame coherence without relying on optical flow, normals, or segmentation networks. The key innovations are the depth scale propagation strategy and a hierarchical frame-pair sampling scheme, which together yield temporally consistent depth maps with an order-of-magnitude speedup over prior test-time approaches and state-of-the-art accuracy on TUM RGB-D, 7Scenes, and ScanNet, as well as robustness on ARKitScenes. This approach enables practical, real-time video depth estimation for real-world applications and consumer devices.

Abstract

The majority of modern single-view depth estimation methods predict relative depth and thus cannot be directly applied in many real-world scenarios, despite impressive performance in the benchmarks. Moreover, single-view approaches cannot guarantee consistency across a sequence of frames. Consistency is typically addressed with test-time optimization of discrepancy across views; however, it takes hours to process a single scene. In this paper, we present MEDeA, an efficient multi-view test-time depth adjustment method, that is an order of magnitude faster than existing test-time approaches. Given RGB frames with camera parameters, MEDeA predicts initial depth maps, adjusts them by optimizing local scaling coefficients, and outputs temporally-consistent depth maps. Contrary to test-time methods requiring normals, optical flow, or semantics estimation, MEDeA produces high-quality predictions with a depth estimation network solely. Our method sets a new state-of-the-art on TUM RGB-D, 7Scenes, and ScanNet benchmarks and successfully handles smartphone-captured data from ARKitScenes dataset.

MEDeA: Multi-view Efficient Depth Adjustment

TL;DR

MEDeA tackles the problem of temporal inconsistency in depth estimation from video by introducing a fast, test-time depth adjustment framework. It combines a pre-trained depth predictor with a lightweight depth deformation model and reprojection-based losses, organized into a two-stage optimization that enforces cross-frame coherence without relying on optical flow, normals, or segmentation networks. The key innovations are the depth scale propagation strategy and a hierarchical frame-pair sampling scheme, which together yield temporally consistent depth maps with an order-of-magnitude speedup over prior test-time approaches and state-of-the-art accuracy on TUM RGB-D, 7Scenes, and ScanNet, as well as robustness on ARKitScenes. This approach enables practical, real-time video depth estimation for real-world applications and consumer devices.

Abstract

The majority of modern single-view depth estimation methods predict relative depth and thus cannot be directly applied in many real-world scenarios, despite impressive performance in the benchmarks. Moreover, single-view approaches cannot guarantee consistency across a sequence of frames. Consistency is typically addressed with test-time optimization of discrepancy across views; however, it takes hours to process a single scene. In this paper, we present MEDeA, an efficient multi-view test-time depth adjustment method, that is an order of magnitude faster than existing test-time approaches. Given RGB frames with camera parameters, MEDeA predicts initial depth maps, adjusts them by optimizing local scaling coefficients, and outputs temporally-consistent depth maps. Contrary to test-time methods requiring normals, optical flow, or semantics estimation, MEDeA produces high-quality predictions with a depth estimation network solely. Our method sets a new state-of-the-art on TUM RGB-D, 7Scenes, and ScanNet benchmarks and successfully handles smartphone-captured data from ARKitScenes dataset.
Paper Structure (25 sections, 8 equations, 3 figures, 5 tables)

This paper contains 25 sections, 8 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Comparison of depth estimation errors and runtime on the TUM-RGBD dataset dehghan2021arkitscenes. The proposed MEDeA surpasses existing test-time optimization methods CVD luo2020cvd, RobustCVD kopf2021rcvd, and GCVD lee2022gcvd) in both accuracy and speed. Our flagship MEDeA-M model outperforms competitors by a huge margin, and even our fast MEDeA-S delivers higher quality than existing approaches with a 25x speed up. Seconds per frame are in logarithmic scale for visibility.
  • Figure 2: Overview of MEDeA. MEDeA relies on a depth estimation model that outputs either metric or up-to-scale depth maps $D^0$. In our depth deformation model, a depth map $D$ is estimated as an initial depth map $D^0$ multiplied by a depth scale map $S$: $D = D^0 \odot S$. At each iteration, a pair of RGB-D frames with indices $(i, j)$ is selected, and $S_i,\ S_j$ are adjusted. Using the current estimate of $D_i$, MEDeA reprojects an image $I_j$, a depth map $D_j$, and a feature map $F_j$ onto the $i$-th viewpoint, and penalizes the divergence of the reprojected $\{ I_{j\rightarrow i}, D_{j\rightarrow i}, F_{j\rightarrow i}\}$ and the original $\{ I_i, D_i, F_i \}$ values. Depth scale maps $S_i,\ S_j$ are optimized via backpropagation. As a result, MEDeA provides consistent depth maps $D_i,\ D_j$.
  • Figure 3: ARKitScenes (top) and TUM RGB-D (bottom) scenes, reconstructed using depth maps produced by different test-time optimization methods, including MEDeA. For a fair comparison, we do not use ground truth camera poses but estimate them with DROID-SLAM teed2021droidslam. Obviously, other methods struggle to restore a general scene structure, while MEDeA provides well-aligned scans with fewer visual artifacts.