Table of Contents
Fetching ...

AsyncMDE: Real-Time Monocular Depth Estimation via Asynchronous Spatial Memory

Lianjie Ma, Yuquan Li, Bingzheng Jiang, Ziming Zhong, Han Ding, Lijun Zhu

TL;DR

Validated across indoor static, dynamic, and synthetic extreme-motion benchmarks, AsyncMDE degrades gracefully between refreshes and achieves 161FPS on a Jetson AGX Orin with TensorRT, clearly demonstrating its feasibility for real-time edge deployment.

Abstract

Foundation-model-based monocular depth estimation offers a viable alternative to active sensors for robot perception, yet its computational cost often prohibits deployment on edge platforms. Existing methods perform independent per-frame inference, wasting the substantial computational redundancy between adjacent viewpoints in continuous robot operation. This paper presents AsyncMDE, an asynchronous depth perception system consisting of a foundation model and a lightweight model that amortizes the foundation model's computational cost over time. The foundation model produces high-quality spatial features in the background, while the lightweight model runs asynchronously in the foreground, fusing cached memory with current observations through complementary fusion, outputting depth estimates, and autoregressively updating the memory. This enables cross-frame feature reuse with bounded accuracy degradation. At a mere 3.83M parameters, it operates at 237 FPS on an RTX 4090, recovering 77% of the accuracy gap to the foundation model while achieving a 25X parameter reduction. Validated across indoor static, dynamic, and synthetic extreme-motion benchmarks, AsyncMDE degrades gracefully between refreshes and achieves 161FPS on a Jetson AGX Orin with TensorRT, clearly demonstrating its feasibility for real-time edge deployment.

AsyncMDE: Real-Time Monocular Depth Estimation via Asynchronous Spatial Memory

TL;DR

Validated across indoor static, dynamic, and synthetic extreme-motion benchmarks, AsyncMDE degrades gracefully between refreshes and achieves 161FPS on a Jetson AGX Orin with TensorRT, clearly demonstrating its feasibility for real-time edge deployment.

Abstract

Foundation-model-based monocular depth estimation offers a viable alternative to active sensors for robot perception, yet its computational cost often prohibits deployment on edge platforms. Existing methods perform independent per-frame inference, wasting the substantial computational redundancy between adjacent viewpoints in continuous robot operation. This paper presents AsyncMDE, an asynchronous depth perception system consisting of a foundation model and a lightweight model that amortizes the foundation model's computational cost over time. The foundation model produces high-quality spatial features in the background, while the lightweight model runs asynchronously in the foreground, fusing cached memory with current observations through complementary fusion, outputting depth estimates, and autoregressively updating the memory. This enables cross-frame feature reuse with bounded accuracy degradation. At a mere 3.83M parameters, it operates at 237 FPS on an RTX 4090, recovering 77% of the accuracy gap to the foundation model while achieving a 25X parameter reduction. Validated across indoor static, dynamic, and synthetic extreme-motion benchmarks, AsyncMDE degrades gracefully between refreshes and achieves 161FPS on a Jetson AGX Orin with TensorRT, clearly demonstrating its feasibility for real-time edge deployment.
Paper Structure (28 sections, 14 equations, 4 figures, 5 tables)

This paper contains 28 sections, 14 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Overview of AsyncMDE. Top: the Slow Path (DAv2-ViTB) periodically refreshes spatial memory; the Fast Path fuses cached memory with each frame at high frequency, with depth maps at increasing lag showing graceful degradation. Bottom: efficiency--accuracy trade-off (three-benchmark average $\delta_1$); AsyncMDE (3.83 M, 237 FPS) recovers 77% of the $\delta_1$ gap between the lightweight baseline and the foundation model.
  • Figure 2: AsyncMDE system overview. DAv2-ViTB runs asynchronously in the background (slow path, $\sim$60 Hz), writing results to spatial memory when available; the lightweight network continuously predicts depth for the current viewpoint (fast path, $\sim$240 Hz), combining cached memory with current observations through complementary fusion and autoregressively updating memory. During training, DAv2 also provides pseudo-label depth for supervision.
  • Figure 3: Lag--accuracy degradation curves. The evaluation interval $N{=}20$ exceeds the training setting ($N{=}10$) to test out-of-distribution generalization. ScanNet and Bonn degrade gracefully within the training interval (lag$\leq$10) and more steeply beyond; Sintel AbsRel saturates beyond lag${=}$10 at $\sim$0.34, exhibiting bounded degradation.
  • Figure 4: Qualitative depth comparison (least-squares aligned). The three rows correspond to ScanNet (indoor static), Bonn (indoor dynamic), and Sintel (synthetic extreme). AsyncMDE produces depth quality comparable to DAv2-ViTB at low lag.