Table of Contents
Fetching ...

Online Video Depth Anything: Temporally-Consistent Depth Prediction with Low Memory Consumption

Johann-Friedrich Feiden, Tim Küchler, Denis Zavadski, Bogdan Savchynskyy, Carsten Rother

TL;DR

This work tackles online monocular video depth estimation by transforming the offline Video Depth Anything (VDA) into online oVDA. It introduces LLM-inspired techniques: caching latent features to form a temporal context and masked attention during training, plus a Scale-and-Shift Consistency Loss (SaSCon) to enforce temporal coherence. The approach achieves state-of-the-art accuracy and low VRAM usage among online methods, running at 42 FPS on an NVIDIA A100 and 20 FPS on a Jetson edge device, with strong temporal stability across diverse datasets. Practical deployment is facilitated by releasing code and edge-compile tooling, enabling real-time depth estimation on low-power hardware. Limitations include scale drift on very long sequences and trailing artefacts with fast-moving objects, pointing to future work for safety-critical applications.

Abstract

Depth estimation from monocular video has become a key component of many real-world computer vision systems. Recently, Video Depth Anything (VDA) has demonstrated strong performance on long video sequences. However, it relies on batch-processing which prohibits its use in an online setting. In this work, we overcome this limitation and introduce online VDA (oVDA). The key innovation is to employ techniques from Large Language Models (LLMs), namely, caching latent features during inference and masking frames at training. Our oVDA method outperforms all competing online video depth estimation methods in both accuracy and VRAM usage. Low VRAM usage is particularly important for deployment on edge devices. We demonstrate that oVDA runs at 42 FPS on an NVIDIA A100 and at 20 FPS on an NVIDIA Jetson edge device. We will release both, code and compilation scripts, making oVDA easy to deploy on low-power hardware.

Online Video Depth Anything: Temporally-Consistent Depth Prediction with Low Memory Consumption

TL;DR

This work tackles online monocular video depth estimation by transforming the offline Video Depth Anything (VDA) into online oVDA. It introduces LLM-inspired techniques: caching latent features to form a temporal context and masked attention during training, plus a Scale-and-Shift Consistency Loss (SaSCon) to enforce temporal coherence. The approach achieves state-of-the-art accuracy and low VRAM usage among online methods, running at 42 FPS on an NVIDIA A100 and 20 FPS on a Jetson edge device, with strong temporal stability across diverse datasets. Practical deployment is facilitated by releasing code and edge-compile tooling, enabling real-time depth estimation on low-power hardware. Limitations include scale drift on very long sequences and trailing artefacts with fast-moving objects, pointing to future work for safety-critical applications.

Abstract

Depth estimation from monocular video has become a key component of many real-world computer vision systems. Recently, Video Depth Anything (VDA) has demonstrated strong performance on long video sequences. However, it relies on batch-processing which prohibits its use in an online setting. In this work, we overcome this limitation and introduce online VDA (oVDA). The key innovation is to employ techniques from Large Language Models (LLMs), namely, caching latent features during inference and masking frames at training. Our oVDA method outperforms all competing online video depth estimation methods in both accuracy and VRAM usage. Low VRAM usage is particularly important for deployment on edge devices. We demonstrate that oVDA runs at 42 FPS on an NVIDIA A100 and at 20 FPS on an NVIDIA Jetson edge device. We will release both, code and compilation scripts, making oVDA easy to deploy on low-power hardware.

Paper Structure

This paper contains 24 sections, 3 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: (Left) Our online Video Depth Anything approach (oVDA) produces high-quality depth predictions at up to 20 FPS on an edge device (NVIDIA Jetson Orin NX) by caching a sliding window of past latent features (red box). (Right) Performance comparison of various online video depth estimation methods for the KITTI Kitti dataset. Our oVDA approach outperforms all competitors in both AbsRel error and VRAM usage. Note that low VRAM usage is crucial for many real-world applications, such as deployment on edge devices. The size of each circle represents the number of parameters.
  • Figure 2: (Left) Our oVDA Architecture operating on the current frame $t=15$. In contrast to offline VDA, which runs in batches of frames (orange box), our online oVDA approach processes only the current frame (red box). The blue blocks process only spatial dimensions, while the Motion Modules (green) perform temporal reasoning and differ from VDA. During training, only the Spatiotemporal Head is fine-tuned, while we keep the DINOv2 backbone frozen. (Middle and Right) Illustration of the inference and training procedure of our new Motion Module with number $j$. Firstly, the hidden features are re-ordered to perform temporal reasoning. During inference, the cache is updated by adding the current latent feature ($L_{j,t=15}^\prime$) and removing the latent feature of the temporally last frame. Afterwards, cross-attention is applied between the current latent feature and the cached latent features. At training time, similar to VDA, we process a batch (here $t=0,...,15$) of frames but apply masked self-attention, ensuring that the current frame can only attend to past frames.
  • Figure 3: Visual comparison for an in-the-wild-video from DAVIS DAVIS2017. We show four frames, equally spaced in time throughout the entire video sequence. The last row gives a stitched image of vertical slices (indicated by the red column in each RGB image, where the column is 24 pixels wide).The rightmost image is again the stitched version of the predicted depths to inspect temporal consistency. Note that both ChronoDepth ChronoDepth and NVDS NVDS are run in their online setting. We see that oVDA produce the temporally most consistent results followed by CUT3R CUT3R, in contrast to e.g., FlashDepth-s flashdepth (black box). However, the predictions of CUT3R are more blurry compared to ours. In summary, our approach is visually best in terms of temporal consistency and on par with FlashDepth in terms of details.
  • Figure 4: Scale drift over time for KITTI Kitti. We plot the absolute relative difference between the optimal scale for the first frame and the optimal scale for each subsequent frame. The grey histogram gives the data support, i.e. the number of data points used to calculate the scale error, which decreases for later frames due to varying sequence lengths. We smoothed the scale error with a window size of 4 for better visualisation. Our oVDA method has lowest scale drift.
  • Figure 5: Visualisation of latent features using PCA. Raw encoder features ($F_1$, $F_2$) are temporally inconsistent, while features after one and two Motion Modules (MM.) remain stable across time, indicating that the Motion Modules enforce temporal consistency.
  • ...and 4 more figures