Table of Contents
Fetching ...

DVD: Deterministic Video Depth Estimation with Generative Priors

Hongfei Zhang, Harold Haodong Chen, Chenfei Liao, Jing He, Zixin Zhang, Haodong Li, Yihao Liang, Kanghao Chen, Bin Ren, Xu Zheng, Shuai Yang, Kun Zhou, Yinchuan Li, Nicu Sebe, Ying-Cong Chen

Abstract

Existing video depth estimation faces a fundamental trade-off: generative models suffer from stochastic geometric hallucinations and scale drift, while discriminative models demand massive labeled datasets to resolve semantic ambiguities. To break this impasse, we present DVD, the first framework to deterministically adapt pre-trained video diffusion models into single-pass depth regressors. Specifically, DVD features three core designs: (i) repurposing the diffusion timestep as a structural anchor to balance global stability with high-frequency details; (ii) latent manifold rectification (LMR) to mitigate regression-induced over-smoothing, enforcing differential constraints to restore sharp boundaries and coherent motion; and (iii) global affine coherence, an inherent property bounding inter-window divergence, which enables seamless long-video inference without requiring complex temporal alignment. Extensive experiments demonstrate that DVD achieves state-of-the-art zero-shot performance across benchmarks. Furthermore, DVD successfully unlocks the profound geometric priors implicit in video foundation models using 163x less task-specific data than leading baselines. Notably, we fully release our pipeline, providing the whole training suite for SOTA video depth estimation to benefit the open-source community.

DVD: Deterministic Video Depth Estimation with Generative Priors

Abstract

Existing video depth estimation faces a fundamental trade-off: generative models suffer from stochastic geometric hallucinations and scale drift, while discriminative models demand massive labeled datasets to resolve semantic ambiguities. To break this impasse, we present DVD, the first framework to deterministically adapt pre-trained video diffusion models into single-pass depth regressors. Specifically, DVD features three core designs: (i) repurposing the diffusion timestep as a structural anchor to balance global stability with high-frequency details; (ii) latent manifold rectification (LMR) to mitigate regression-induced over-smoothing, enforcing differential constraints to restore sharp boundaries and coherent motion; and (iii) global affine coherence, an inherent property bounding inter-window divergence, which enables seamless long-video inference without requiring complex temporal alignment. Extensive experiments demonstrate that DVD achieves state-of-the-art zero-shot performance across benchmarks. Furthermore, DVD successfully unlocks the profound geometric priors implicit in video foundation models using 163x less task-specific data than leading baselines. Notably, we fully release our pipeline, providing the whole training suite for SOTA video depth estimation to benefit the open-source community.
Paper Structure (22 sections, 10 equations, 20 figures, 10 tables)

This paper contains 22 sections, 10 equations, 20 figures, 10 tables.

Figures (20)

  • Figure 1: (Top) Comparisons on a $1500$-frame in-the-wild video highlight a fundamental paradigm trade-off: representative generative models (e.g., DepthCrafter hu2025depthcrafter) suffer from geometric hallucination, while leading discriminative baselines (e.g., VDA chen2025video) face semantic ambiguity. DVD resolves this dilemma, delivering consistent, high-fidelity geometry. (Bottom) DVD achieves superior performance on both short and long videos (averaged on KITTI Geiger2012CVPR, ScanNet dai2017scannet, and Bonn palazzolo2019bonn), while successfully unlocking the rich priors implicit in video foundation models using remarkably minimal task-specific data, e.g., less than $1\%$ of VDA's training set.
  • Figure 2: Overview of DVD. (Top) A video DiT ($\mathcal{F}_\theta$) performs single-pass depth regression, modulated by a structural anchor ($\tau_0$). Latent manifold rectification (LMR) mitigates mean collapse via differential constraints. (Bottom) For long video depth estimation, overlapping windows ($\mathcal{W}_A, \mathcal{W}_B$) are seamlessly aligned using a closed-form least-squares solver, leveraging the model's global affine coherence.
  • Figure 3: Timestep as a structural anchor. Visualizations on NYU SilbermanECCV12nyuv2 demonstrate a fidelity-stability trade-off. Low ($\tau=0.0$) recovers sharp boundaries but lacks global consistency, whereas high ($\tau=0.8$) causes detail loss (e.g., blur). An optimal anchor ($\tau=0.5$) balances these regimes, achieving a trade-off between detail recovery and metric accuracy. More detailed quantitative analyses are shown in Figure \ref{['fig:ablation_timestep']}.
  • Figure 4: Timestep embedding similarity. Cosine similarity matrix of timestep embeddings ($t \in [0, 1]$, stride $0.1$). While embeddings are broadly consistent, mid-range timesteps exhibit high similarity with a wider range of states.
  • Figure 5: LMR mitigates mean collapse. Naive regression (2nd Row) exhibits mean collapse, losing high-frequency details. In contrast, our LMR (3rd Row) enforces differential constraints to rectify the latent manifold, recovering both sharp spatial boundaries and temporal coherence. Quantitative analyses are placed in Figure \ref{['fig:ablation']}.
  • ...and 15 more figures