Table of Contents
Fetching ...

D4D: An RGBD diffusion model to boost monocular depth estimation

L. Papa, P. Russo, I. Amerini

TL;DR

The paper tackles the scarcity of labeled RGBD data for monocular depth estimation by introducing Diffusion4D (D4D), a customized 4-channel diffusion model that generates realistic RGBD samples. It designs three diffusion configurations (S1, S2, S3) and integrates offline data generation into a three-stage training pipeline to augment four MDE backbones (DenseDepth, FastDepth, SPEED, METER). Empirical results on NYU Depth v2 and KITTI show consistent RMSE improvements over both synthetic and original data baselines, with notable gains in indoor, outdoor, and cross-domain scenarios, including DIML/CVL RGB-D. The authors also release D4D-NYU and D4D-KITTI datasets and demonstrate the method's applicability to efficient ViT variants, highlighting a practical path to mitigating data scarcity in depth-related tasks. Overall, D4D provides a general, diffusion-based data augmentation strategy that improves depth estimation accuracy and generalization in real-world settings.

Abstract

Ground-truth RGBD data are fundamental for a wide range of computer vision applications; however, those labeled samples are difficult to collect and time-consuming to produce. A common solution to overcome this lack of data is to employ graphic engines to produce synthetic proxies; however, those data do not often reflect real-world images, resulting in poor performance of the trained models at the inference step. In this paper we propose a novel training pipeline that incorporates Diffusion4D (D4D), a customized 4-channels diffusion model able to generate realistic RGBD samples. We show the effectiveness of the developed solution in improving the performances of deep learning models on the monocular depth estimation task, where the correspondence between RGB and depth map is crucial to achieving accurate measurements. Our supervised training pipeline, enriched by the generated samples, outperforms synthetic and original data performances achieving an RMSE reduction of (8.2%, 11.9%) and (8.1%, 6.1%) respectively on the indoor NYU Depth v2 and the outdoor KITTI dataset.

D4D: An RGBD diffusion model to boost monocular depth estimation

TL;DR

The paper tackles the scarcity of labeled RGBD data for monocular depth estimation by introducing Diffusion4D (D4D), a customized 4-channel diffusion model that generates realistic RGBD samples. It designs three diffusion configurations (S1, S2, S3) and integrates offline data generation into a three-stage training pipeline to augment four MDE backbones (DenseDepth, FastDepth, SPEED, METER). Empirical results on NYU Depth v2 and KITTI show consistent RMSE improvements over both synthetic and original data baselines, with notable gains in indoor, outdoor, and cross-domain scenarios, including DIML/CVL RGB-D. The authors also release D4D-NYU and D4D-KITTI datasets and demonstrate the method's applicability to efficient ViT variants, highlighting a practical path to mitigating data scarcity in depth-related tasks. Overall, D4D provides a general, diffusion-based data augmentation strategy that improves depth estimation accuracy and generalization in real-world settings.

Abstract

Ground-truth RGBD data are fundamental for a wide range of computer vision applications; however, those labeled samples are difficult to collect and time-consuming to produce. A common solution to overcome this lack of data is to employ graphic engines to produce synthetic proxies; however, those data do not often reflect real-world images, resulting in poor performance of the trained models at the inference step. In this paper we propose a novel training pipeline that incorporates Diffusion4D (D4D), a customized 4-channels diffusion model able to generate realistic RGBD samples. We show the effectiveness of the developed solution in improving the performances of deep learning models on the monocular depth estimation task, where the correspondence between RGB and depth map is crucial to achieving accurate measurements. Our supervised training pipeline, enriched by the generated samples, outperforms synthetic and original data performances achieving an RMSE reduction of (8.2%, 11.9%) and (8.1%, 6.1%) respectively on the indoor NYU Depth v2 and the outdoor KITTI dataset.
Paper Structure (6 sections, 10 equations, 6 figures, 8 tables)

This paper contains 6 sections, 10 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: D4D generated RGBD samples based on the indoor NYU Depth v2 (right) and the outdoor KITTI (left) datasets. The images are scaled to match the aspect ratio of the original samples. The depth maps are converted in RGB format with a perceptually uniform colormap for a better view, while the two bottom colorbars emphasize the depth data distribution (in meters) over the generated samples.
  • Figure 2: Graphical representation of the introduced training pipeline. Stage 1 shows the pre-processing operations applied on 4-channels samples extracted from the original training dataset. Stage 2 emphasizes the training and unconditioned generation processes of D4D model. Stage 3 depicts the training procedure of a generic encoder-decoder MDE network by highlighting how the RGBD training samples are composed.
  • Figure 3: Indoor results. Qualitative analysis of the estimated prediction obtained with DenseDepth method. The model has been tested on NYU (indoor) dataset. S0 is the baseline setup, i.e., when the MDE model is trained only on the NYU dataset. In Synthetic setup, DenseDepth has been trained over NYU and a $50K$ subset from the SceneNet dataset. In Si with $i=[1, 3]$, as described in Section 3, DenseDepth has been trained over NYU and $50K$ samples taken from our proposed D4D-NYU datasets generated at a resolution of $320\times240$. The Difference Map is computed as a per pixel-difference between predicted ($\hat{y}$) and expected depth ($y$), while the reported colorbars are used to emphasize the depth/error range in centimeters ($cm$).
  • Figure 4: Outdoor results. Qualitative analysis of the estimated prediction obtained with DenseDepth method. The model has been tested on KITTI (outdoor) dataset. S0 is the baseline setup, i.e., when DenseDepth is trained only on KITTI dataset. In Synthetic setup, the model has been trained over KITTI and SYNTHIA-SF datasets. In the proposed configuration (S3), the model has been trained over KITTI and $50K$ samples taken from our proposed D4D-KITTI datasets generated at a resolution of $320\times240$. The Difference Map is computed as a per pixel-difference between predicted ($\hat{y}$) and expected depth ($y$), while the reported colorbars are used to emphasize the depth/error range in decimeters ($dm$).
  • Figure 5: Generalization. Qualitative analysis of the estimated prediction obtained with DenseDepth method. The model has been tested in blind condition (i.e., without fine-tuning) on DIML/CVL RGB-D dataset when trained on a different indoor dataset, i.e., NYU for S0, SceneNet for Synthetic, and D4D-NYU for S1, S2, and S3. The Difference Map is computed as a per pixel-difference between predicted ($\hat{y}$) and expected depth ($y$), while the reported colorbars are used to emphasize the depth/error range in centimeters ($cm$).
  • ...and 1 more figures