Table of Contents
Fetching ...

Video Depth without Video Models

Bingxin Ke, Dominik Narnhofer, Shengyu Huang, Lei Ke, Torben Peters, Katerina Fragkiadaki, Anton Obukhov, Konrad Schindler

TL;DR

The paper tackles the challenge of temporally coherent video depth estimation without resorting to heavy video diffusion models. It shows that a single-image latent diffusion model can be extended to handle short frame snippets via multi-frame self-attention, sampled at varying frame rates, and then globally aligned into a consistent depth video using a robust optimization over per-snippet scales and shifts. An optional diffusion-based refinement further sharpens details without altering the global depth layout. Across diverse datasets, RollingDepth achieves state-of-the-art or competitive performance for long videos, balancing per-frame accuracy with temporal stability and offering a scalable alternative to traditional video-based depth estimation pipelines.

Abstract

Video depth estimation lifts monocular video clips to 3D by inferring dense depth at every frame. Recent advances in single-image depth estimation, brought about by the rise of large foundation models and the use of synthetic training data, have fueled a renewed interest in video depth. However, naively applying a single-image depth estimator to every frame of a video disregards temporal continuity, which not only leads to flickering but may also break when camera motion causes sudden changes in depth range. An obvious and principled solution would be to build on top of video foundation models, but these come with their own limitations; including expensive training and inference, imperfect 3D consistency, and stitching routines for the fixed-length (short) outputs. We take a step back and demonstrate how to turn a single-image latent diffusion model (LDM) into a state-of-the-art video depth estimator. Our model, which we call RollingDepth, has two main ingredients: (i) a multi-frame depth estimator that is derived from a single-image LDM and maps very short video snippets (typically frame triplets) to depth snippets. (ii) a robust, optimization-based registration algorithm that optimally assembles depth snippets sampled at various different frame rates back into a consistent video. RollingDepth is able to efficiently handle long videos with hundreds of frames and delivers more accurate depth videos than both dedicated video depth estimators and high-performing single-frame models. Project page: rollingdepth.github.io.

Video Depth without Video Models

TL;DR

The paper tackles the challenge of temporally coherent video depth estimation without resorting to heavy video diffusion models. It shows that a single-image latent diffusion model can be extended to handle short frame snippets via multi-frame self-attention, sampled at varying frame rates, and then globally aligned into a consistent depth video using a robust optimization over per-snippet scales and shifts. An optional diffusion-based refinement further sharpens details without altering the global depth layout. Across diverse datasets, RollingDepth achieves state-of-the-art or competitive performance for long videos, balancing per-frame accuracy with temporal stability and offering a scalable alternative to traditional video-based depth estimation pipelines.

Abstract

Video depth estimation lifts monocular video clips to 3D by inferring dense depth at every frame. Recent advances in single-image depth estimation, brought about by the rise of large foundation models and the use of synthetic training data, have fueled a renewed interest in video depth. However, naively applying a single-image depth estimator to every frame of a video disregards temporal continuity, which not only leads to flickering but may also break when camera motion causes sudden changes in depth range. An obvious and principled solution would be to build on top of video foundation models, but these come with their own limitations; including expensive training and inference, imperfect 3D consistency, and stitching routines for the fixed-length (short) outputs. We take a step back and demonstrate how to turn a single-image latent diffusion model (LDM) into a state-of-the-art video depth estimator. Our model, which we call RollingDepth, has two main ingredients: (i) a multi-frame depth estimator that is derived from a single-image LDM and maps very short video snippets (typically frame triplets) to depth snippets. (ii) a robust, optimization-based registration algorithm that optimally assembles depth snippets sampled at various different frame rates back into a consistent video. RollingDepth is able to efficiently handle long videos with hundreds of frames and delivers more accurate depth videos than both dedicated video depth estimators and high-performing single-frame models. Project page: rollingdepth.github.io.

Paper Structure

This paper contains 27 sections, 6 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: The RollingDepth model takes an unconstrained video and reconstructs a corresponding depth video. Unlike methods that rely on video diffusion models, it extends a single-image monodepth estimator such that it can process short snippets. To account for temporal context, snippets with varying frame rates are sampled from the video, processed, and reassembled through a global alignment algorithm to obtain long, temporally coherent depth videos. Depth is colour-coded near far.
  • Figure 2: Overview of the RollingDepth Inference Pipeline. Given a video sequence $\mathbf{x}$ (with is $i^\text{th}$ frame), we construct $N_T$ overlapping snippets using a dilated rolling kernel with varying dilation rates, and perform 1-step inference to obtain initial depth snippets ( ). Next, depth co-alignment optimizes $N_T$ pairs of scale and shift values to achieve globally consistent depth throughout the full video. An optional refinement step further enhances details by applying additional, snippet-based denoising steps.
  • Figure 3: Depth Refinement encodes the co-aligned depth video into latent space, contaminates it with a moderate amount of noise, then denoises it with a series of reverse diffusion steps with decreasing snippet dilation rate. After each step, overlapping latents are averaged to propagate information between snippets.
  • Figure 4: Qualitative comparison between different methods. RollingDepth excels at preserving fine-grained details (cf. the chandelier in the first sample and the tripod in the third sample) and recovering accurate scene layout (cf. the far plane in the second sample).
  • Figure 5: AbsRel error over time: The line plot (left) shows the depth error at every individual frame, end-of-line numbers are the average error across the video. The images (right) display error maps (low high) for two specific frames. RollingDepth achieves the lowest error overall, competing methods recover scene layout less faithfully and tend to be biased towards the foreground or the background.
  • ...and 4 more figures