Table of Contents
Fetching ...

PrimeDepth: Efficient Monocular Depth Estimation with a Stable Diffusion Preimage

Denis Zavadski, Damjan Kalšan, Carsten Rother

TL;DR

PrimeDepth is presented, a method that is highly efficient at test time while keeping, or even enhancing, the positive aspects of diffusion-based approaches, and reduces the gap to the currently leading data-driven approach, Depth Anything, which is still quantitatively superior, but predicts less detailed depth maps and requires 20 times more labelled data.

Abstract

This work addresses the task of zero-shot monocular depth estimation. A recent advance in this field has been the idea of utilising Text-to-Image foundation models, such as Stable Diffusion. Foundation models provide a rich and generic image representation, and therefore, little training data is required to reformulate them as a depth estimation model that predicts highly-detailed depth maps and has good generalisation capabilities. However, the realisation of this idea has so far led to approaches which are, unfortunately, highly inefficient at test-time due to the underlying iterative denoising process. In this work, we propose a different realisation of this idea and present PrimeDepth, a method that is highly efficient at test time while keeping, or even enhancing, the positive aspects of diffusion-based approaches. Our key idea is to extract from Stable Diffusion a rich, but frozen, image representation by running a single denoising step. This representation, we term preimage, is then fed into a refiner network with an architectural inductive bias, before entering the downstream task. We validate experimentally that PrimeDepth is two orders of magnitude faster than the leading diffusion-based method, Marigold, while being more robust for challenging scenarios and quantitatively marginally superior. Thereby, we reduce the gap to the currently leading data-driven approach, Depth Anything, which is still quantitatively superior, but predicts less detailed depth maps and requires 20 times more labelled data. Due to the complementary nature of our approach, even a simple averaging between PrimeDepth and Depth Anything predictions can improve upon both methods and sets a new state-of-the-art in zero-shot monocular depth estimation. In future, data-driven approaches may also benefit from integrating our preimage.

PrimeDepth: Efficient Monocular Depth Estimation with a Stable Diffusion Preimage

TL;DR

PrimeDepth is presented, a method that is highly efficient at test time while keeping, or even enhancing, the positive aspects of diffusion-based approaches, and reduces the gap to the currently leading data-driven approach, Depth Anything, which is still quantitatively superior, but predicts less detailed depth maps and requires 20 times more labelled data.

Abstract

This work addresses the task of zero-shot monocular depth estimation. A recent advance in this field has been the idea of utilising Text-to-Image foundation models, such as Stable Diffusion. Foundation models provide a rich and generic image representation, and therefore, little training data is required to reformulate them as a depth estimation model that predicts highly-detailed depth maps and has good generalisation capabilities. However, the realisation of this idea has so far led to approaches which are, unfortunately, highly inefficient at test-time due to the underlying iterative denoising process. In this work, we propose a different realisation of this idea and present PrimeDepth, a method that is highly efficient at test time while keeping, or even enhancing, the positive aspects of diffusion-based approaches. Our key idea is to extract from Stable Diffusion a rich, but frozen, image representation by running a single denoising step. This representation, we term preimage, is then fed into a refiner network with an architectural inductive bias, before entering the downstream task. We validate experimentally that PrimeDepth is two orders of magnitude faster than the leading diffusion-based method, Marigold, while being more robust for challenging scenarios and quantitatively marginally superior. Thereby, we reduce the gap to the currently leading data-driven approach, Depth Anything, which is still quantitatively superior, but predicts less detailed depth maps and requires 20 times more labelled data. Due to the complementary nature of our approach, even a simple averaging between PrimeDepth and Depth Anything predictions can improve upon both methods and sets a new state-of-the-art in zero-shot monocular depth estimation. In future, data-driven approaches may also benefit from integrating our preimage.
Paper Structure (24 sections, 4 equations, 14 figures, 5 tables)

This paper contains 24 sections, 4 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Results of Depth Anything Yang2024_DepthAnything (top row), Marigold Marigold_Ke_2024_CVPR (middle row) and our PrimeDepth (bottom row), for a challenging scene from the ETH3D Dataset schops2017ETH3D. While our non-optimised method is fast at test time (0.57 sec), as well as Depth Anything (0.13 sec), Marigold is rather slow (62.84 sec). Runtimes were measured on an A100 GPU. Visually, our result shows most details, i.e. more detailed than Depth Anything and less grainy than Marigold, which however is not reflected in the quantitative numbers for this image (Depth Anything $\delta_1 = 99.95\%$, Marigold $\delta_1 = 99.88\%$ and Ours $\delta_1 = 99.18\%$). The reason is the sparse ground truth LIDAR data with holes for objects with fine details (see bottom, left). While the data-driven method, Depth Anything, requires a large corpus of training data (1.5M labelled and 62M unlabelled images), ours and Marigold only need 74K synthetic training images.
  • Figure 2: (Left) Stable Diffusion preimage consisting of intermediate feature maps, cross- and self-attention maps for every neural block of the last denoising step. (Right) Examples of self-attention maps, with respect to the red square, and cross-attention maps. Below is the fusion model shown, for the aggregation of the preimage parts.
  • Figure 3: PrimeDepth Pipeline. The input image is first encoded to latent domain, augmented with one noise step and processed by the frozen U-Net of Stable Diffusion. The intermediate parts of the preimage (red arrows) are aggregated with the fusion module (see \ref{['fig:pre_image_and_attentions']}) and provided to the preimage refiner network at the respective intermediate stages. The output of the refiner is fed to two downstream heads for the respective downstream tasks.
  • Figure 4: Qualitative results of two competing methods (Depth Anything Yang2024_DepthAnything and Marigold Marigold_Ke_2024_CVPR) for 4 datasets, while results for ETH3D are shown in \ref{['fig:intro_teaser']}. The main visual artefacts of the respective methods are indicated by arrows. The prominent observations across many images are as follows. Depth Anything has less sharp depth maps (KITTI, nuScenes-C, \ref{['fig:intro_teaser']}) and can see inside mirrors (NYUv2) and through transparent surfaces (supplement). Marigold predicts sharp depth maps but sometimes with grainy artefacts (KITTI, \ref{['fig:intro_teaser']}). It also struggles to predict sky (KITTI, nuScenes-C) and objects at mid-distance (rabbitai). Our method gives sharper depth maps than Depth Anything (KITTI, NuScenes-C), but can also struggle with sky (supplement).
  • Figure 5: Box plot for the $\delta_1$ accuracy of challenging scenes from the nuScenes-C dataset, split into 6 categories, where long, vertical stripes provide the median values. Our PrimeDepth is consistently, marginally inferior to Depth Anything Yang2024_DepthAnything, but consistently and sometimes considerably superior to Marigold Marigold_Ke_2024_CVPR. The variability, measured by IQR score i.e. size of a box, is considerably higher for Marigold than Depth Anything and our method. For nighttime scenes, the performance drops for all methods, however Marigold is clearly more affected (lowest median and highest variability).
  • ...and 9 more figures