MEDeA: Multi-view Efficient Depth Adjustment
Mikhail Artemyev, Anna Vorontsova, Anna Sokolova, Alexander Limonov
TL;DR
MEDeA tackles the problem of temporal inconsistency in depth estimation from video by introducing a fast, test-time depth adjustment framework. It combines a pre-trained depth predictor with a lightweight depth deformation model and reprojection-based losses, organized into a two-stage optimization that enforces cross-frame coherence without relying on optical flow, normals, or segmentation networks. The key innovations are the depth scale propagation strategy and a hierarchical frame-pair sampling scheme, which together yield temporally consistent depth maps with an order-of-magnitude speedup over prior test-time approaches and state-of-the-art accuracy on TUM RGB-D, 7Scenes, and ScanNet, as well as robustness on ARKitScenes. This approach enables practical, real-time video depth estimation for real-world applications and consumer devices.
Abstract
The majority of modern single-view depth estimation methods predict relative depth and thus cannot be directly applied in many real-world scenarios, despite impressive performance in the benchmarks. Moreover, single-view approaches cannot guarantee consistency across a sequence of frames. Consistency is typically addressed with test-time optimization of discrepancy across views; however, it takes hours to process a single scene. In this paper, we present MEDeA, an efficient multi-view test-time depth adjustment method, that is an order of magnitude faster than existing test-time approaches. Given RGB frames with camera parameters, MEDeA predicts initial depth maps, adjusts them by optimizing local scaling coefficients, and outputs temporally-consistent depth maps. Contrary to test-time methods requiring normals, optical flow, or semantics estimation, MEDeA produces high-quality predictions with a depth estimation network solely. Our method sets a new state-of-the-art on TUM RGB-D, 7Scenes, and ScanNet benchmarks and successfully handles smartphone-captured data from ARKitScenes dataset.
