Table of Contents
Fetching ...

FiffDepth: Feed-forward Transformation of Diffusion-Based Generators for Detailed Depth Estimation

Yunpeng Bai, Qixing Huang

TL;DR

FiffDepth tackles monocular depth estimation under limited real labeled data by transforming a pretrained diffusion model into a deterministic, feed-forward depth predictor. By preserving the diffusion trajectory and introducing a learnable filter distillation that leverages DINOv2 pseudo-labels, it combines the detail-richness of generative models with the robust generalization of discriminative, pretrained nets. The approach uses synthetic data at $t=0$ for detail and real-data pseudo-label supervision at $t=-1$, with latent-space MAE, gradient-matching, and trajectory losses to optimize depth predictions. Empirically, it achieves strong zero-shot generalization, fine-grained depth details, and competitive efficiency compared to diffusion-based methods across diverse real-world scenes.

Abstract

Monocular Depth Estimation (MDE) is a fundamental 3D vision problem with numerous applications such as 3D scene reconstruction, autonomous navigation, and AI content creation. However, robust and generalizable MDE remains challenging due to limited real-world labeled data and distribution gaps between synthetic datasets and real data. Existing methods often struggle with real-world test data with low efficiency, reduced accuracy, and lack of detail. To address these issues, we propose an efficient MDE approach named FiffDepth. The key feature of FiffDepth is its use of diffusion priors. It transforms diffusion-based image generators into a feed-forward architecture for detailed depth estimation. FiffDepth preserves key generative features and integrates the strong generalization capabilities of models like DINOv2. Through benchmark evaluations, we demonstrate that FiffDepth achieves exceptional accuracy, stability, and fine-grained detail, offering significant improvements in MDE performance against state-of-the-art MDE approaches. The paper's source code is available here: https://yunpeng1998.github.io/FiffDepth/

FiffDepth: Feed-forward Transformation of Diffusion-Based Generators for Detailed Depth Estimation

TL;DR

FiffDepth tackles monocular depth estimation under limited real labeled data by transforming a pretrained diffusion model into a deterministic, feed-forward depth predictor. By preserving the diffusion trajectory and introducing a learnable filter distillation that leverages DINOv2 pseudo-labels, it combines the detail-richness of generative models with the robust generalization of discriminative, pretrained nets. The approach uses synthetic data at for detail and real-data pseudo-label supervision at , with latent-space MAE, gradient-matching, and trajectory losses to optimize depth predictions. Empirically, it achieves strong zero-shot generalization, fine-grained depth details, and competitive efficiency compared to diffusion-based methods across diverse real-world scenes.

Abstract

Monocular Depth Estimation (MDE) is a fundamental 3D vision problem with numerous applications such as 3D scene reconstruction, autonomous navigation, and AI content creation. However, robust and generalizable MDE remains challenging due to limited real-world labeled data and distribution gaps between synthetic datasets and real data. Existing methods often struggle with real-world test data with low efficiency, reduced accuracy, and lack of detail. To address these issues, we propose an efficient MDE approach named FiffDepth. The key feature of FiffDepth is its use of diffusion priors. It transforms diffusion-based image generators into a feed-forward architecture for detailed depth estimation. FiffDepth preserves key generative features and integrates the strong generalization capabilities of models like DINOv2. Through benchmark evaluations, we demonstrate that FiffDepth achieves exceptional accuracy, stability, and fine-grained detail, offering significant improvements in MDE performance against state-of-the-art MDE approaches. The paper's source code is available here: https://yunpeng1998.github.io/FiffDepth/

Paper Structure

This paper contains 13 sections, 7 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Compared to other methods, our model achieves more accurate details and better generalization in depth estimation. The final row shows the point cloud generated from the estimated depth results, and the corresponding depth map can be referenced in Figure \ref{['fig:com2']}.
  • Figure 2: Overview of the proposed method. To simplify the representation, all the images we used above correspond to the respective latents. We transform the pre-trained diffusion model into a feed-forward approach for depth prediction, using only the result at $t=0$ as the output during inference. During training, at $t=0$, we use synthetic data to ensure detailed results, while at $t=-1$, we leverage pseudo-labels generated by DINOv2 for supervision.
  • Figure 3: Filter learning. We use a learnable filter to map our results to detail levels similar to DINOv2’s, matching its outputs and thereby transferring DINOv2’s generalization capabilities to our model without compromising our inherent details.
  • Figure 4: Qualitative comparison across different datasets. Our method is capable of predicting the depth of various fine objects, such as lampposts, railings, and chair legs.
  • Figure 5: Qualitative comparison on special scenarios. In the special scenarios of games, artworks, AI-generated content, and movies, our method demonstrates strong generalization capability and the ability to predict detailed depth.
  • ...and 2 more figures