Table of Contents
Fetching ...

StereoDiff: Stereo-Diffusion Synergy for Video Depth Estimation

Haodong Li, Chen Wang, Jiahui Lei, Kostas Daniilidis, Lingjie Liu

TL;DR

StereoDiff addresses the challenge of video depth estimation by separating global and local consistency: static backgrounds are stabilized via a stereo-matching stage that yields strong global depth cues, while dynamic regions gain temporal smoothness through a one-shot video depth diffusion stage that denoises high-frequency fluctuations. The method is fully inference-based and training-free, leveraging MonST3R for robust stereo correspondences and DepthCrafter for diffusion priors, with a frequency-domain justification showing preservation of low-frequency global content and attenuation of high-frequency local noise. Empirically, StereoDiff achieves SoTA performance on four zero-shot benchmarks (Bonn, KITTI, ScanNetV2, Sintel), delivering improved temporal stability and cross-frame coherence with about 2.1x faster inference than prior diffusion-based approaches. The work highlights a principled synergy between geometry-based global cues and data-driven priors, enabling reliable video depth estimation across indoor and outdoor scenes while maintaining efficiency.

Abstract

Recent video depth estimation methods achieve great performance by following the paradigm of image depth estimation, i.e., typically fine-tuning pre-trained video diffusion models with massive data. However, we argue that video depth estimation is not a naive extension of image depth estimation. The temporal consistency requirements for dynamic and static regions in videos are fundamentally different. Consistent video depth in static regions, typically backgrounds, can be more effectively achieved via stereo matching across all frames, which provides much stronger global 3D cues. While the consistency for dynamic regions still should be learned from large-scale video depth data to ensure smooth transitions, due to the violation of triangulation constraints. Based on these insights, we introduce StereoDiff, a two-stage video depth estimator that synergizes stereo matching for mainly the static areas with video depth diffusion for maintaining consistent depth transitions in dynamic areas. We mathematically demonstrate how stereo matching and video depth diffusion offer complementary strengths through frequency domain analysis, highlighting the effectiveness of their synergy in capturing the advantages of both. Experimental results on zero-shot, real-world, dynamic video depth benchmarks, both indoor and outdoor, demonstrate StereoDiff's SoTA performance, showcasing its superior consistency and accuracy in video depth estimation.

StereoDiff: Stereo-Diffusion Synergy for Video Depth Estimation

TL;DR

StereoDiff addresses the challenge of video depth estimation by separating global and local consistency: static backgrounds are stabilized via a stereo-matching stage that yields strong global depth cues, while dynamic regions gain temporal smoothness through a one-shot video depth diffusion stage that denoises high-frequency fluctuations. The method is fully inference-based and training-free, leveraging MonST3R for robust stereo correspondences and DepthCrafter for diffusion priors, with a frequency-domain justification showing preservation of low-frequency global content and attenuation of high-frequency local noise. Empirically, StereoDiff achieves SoTA performance on four zero-shot benchmarks (Bonn, KITTI, ScanNetV2, Sintel), delivering improved temporal stability and cross-frame coherence with about 2.1x faster inference than prior diffusion-based approaches. The work highlights a principled synergy between geometry-based global cues and data-driven priors, enabling reliable video depth estimation across indoor and outdoor scenes while maintaining efficiency.

Abstract

Recent video depth estimation methods achieve great performance by following the paradigm of image depth estimation, i.e., typically fine-tuning pre-trained video diffusion models with massive data. However, we argue that video depth estimation is not a naive extension of image depth estimation. The temporal consistency requirements for dynamic and static regions in videos are fundamentally different. Consistent video depth in static regions, typically backgrounds, can be more effectively achieved via stereo matching across all frames, which provides much stronger global 3D cues. While the consistency for dynamic regions still should be learned from large-scale video depth data to ensure smooth transitions, due to the violation of triangulation constraints. Based on these insights, we introduce StereoDiff, a two-stage video depth estimator that synergizes stereo matching for mainly the static areas with video depth diffusion for maintaining consistent depth transitions in dynamic areas. We mathematically demonstrate how stereo matching and video depth diffusion offer complementary strengths through frequency domain analysis, highlighting the effectiveness of their synergy in capturing the advantages of both. Experimental results on zero-shot, real-world, dynamic video depth benchmarks, both indoor and outdoor, demonstrate StereoDiff's SoTA performance, showcasing its superior consistency and accuracy in video depth estimation.

Paper Structure

This paper contains 20 sections, 9 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: StereoDiff excels in delivering remarkable global and local consistency for video depth estimation. In terms of global consistency, StereoDiff achieves highly accurate and stable depth maps on static backgrounds across consecutive windows, leveraging stereo matching to prevent the abrupt depth shifts often seen in DepthCrafter hu2024depthcrafter, where depth values on static backgrounds can vary significantly between adjacent windows. For local consistency, StereoDiff yields much smoother, flicker-free depth values across consecutive frames, especially in dynamic regions. In contrast, MonST3R zhang2024monst3r suffers from frequent, pronounced flickering and jitters in these areas.
  • Figure 2: Pipeline of StereoDiff. ① All video frames are paired for stereo matching in the first stage, primarily focusing on static backgrounds, in order to achieve a strong global consistency. ② Using the stereo matching-based video depth from the first stage, the second stage of StereoDiff applies a single-step video depth diffusion for significantly improving the local consistency without sacrificing its original global consistency, resulting in video depth estimations with both strong global consistency and smooth local consistency.
  • Figure 3: Magnitude spectrum of the error sequence on Bonn palazzolo2019refusion dataset. The first scene of Bonn, "balloon", containing 438 frames, is used as an example here. Due to symmetry, only the second half of the frequency spectrum is shown.
  • Figure 4: Comparison of mean disparity value $\overline{1/{D_t}}$ tested on Bonn palazzolo2019refusion dataset for MonST3R zhang2024monst3r, DepthCrafter hu2024depthcrafter, and StereoDiff. All disparity maps are normalized to $[0, 1]$ on a per-scene basis before comparison. Incorporating ZeroSNR drags the mean value of StereoDiff's disparity maps closer to the GT, resulting in improved performance (Tab. \ref{['tab:ablation']}).
  • Figure 5: Qualitative comparisons on Bonn dataset, conducted among MonST3R, DepthCrafter, and StereoDiff. Four continuous frames are sampled from a video depth sequence to form a complete comparison set. Please visit the https://stereodiff.github.io/ for video comparisons.
  • ...and 4 more figures