Online Video Depth Anything: Temporally-Consistent Depth Prediction with Low Memory Consumption
Johann-Friedrich Feiden, Tim Küchler, Denis Zavadski, Bogdan Savchynskyy, Carsten Rother
TL;DR
This work tackles online monocular video depth estimation by transforming the offline Video Depth Anything (VDA) into online oVDA. It introduces LLM-inspired techniques: caching latent features to form a temporal context and masked attention during training, plus a Scale-and-Shift Consistency Loss (SaSCon) to enforce temporal coherence. The approach achieves state-of-the-art accuracy and low VRAM usage among online methods, running at 42 FPS on an NVIDIA A100 and 20 FPS on a Jetson edge device, with strong temporal stability across diverse datasets. Practical deployment is facilitated by releasing code and edge-compile tooling, enabling real-time depth estimation on low-power hardware. Limitations include scale drift on very long sequences and trailing artefacts with fast-moving objects, pointing to future work for safety-critical applications.
Abstract
Depth estimation from monocular video has become a key component of many real-world computer vision systems. Recently, Video Depth Anything (VDA) has demonstrated strong performance on long video sequences. However, it relies on batch-processing which prohibits its use in an online setting. In this work, we overcome this limitation and introduce online VDA (oVDA). The key innovation is to employ techniques from Large Language Models (LLMs), namely, caching latent features during inference and masking frames at training. Our oVDA method outperforms all competing online video depth estimation methods in both accuracy and VRAM usage. Low VRAM usage is particularly important for deployment on edge devices. We demonstrate that oVDA runs at 42 FPS on an NVIDIA A100 and at 20 FPS on an NVIDIA Jetson edge device. We will release both, code and compilation scripts, making oVDA easy to deploy on low-power hardware.
