Table of Contents
Fetching ...

Virtually Enriched NYU Depth V2 Dataset for Monocular Depth Estimation: Do We Need Artificial Augmentation?

Dmitry Ignatov, Andrey Ignatov, Radu Timofte

TL;DR

The authors address limitations in indoor monocular depth estimation data by introducing ANYU, a virtually augmented NYU Depth V2 dataset created by randomly embedding VR objects into real RGB-D frames using Unity. ANYU is released in 10% and 100% augmentation configurations and evaluated with both diffusion-based VPD and transformer-based PixelFormer models, demonstrating improved depth estimation and generalization on NYU-v2 and iBims-1. The key finding is that randomized virtual augmentation can meaningfully enhance model robustness across architectures, with notable gains at modest augmentation levels and continued cross-dataset benefits, culminating in state-of-the-art results for at least one model. The work provides practical guidance for virtual augmentation strategies and publicly releases the ANYU dataset for broader adoption in indoor monocular depth estimation.

Abstract

We present ANYU, a new virtually augmented version of the NYU depth v2 dataset, designed for monocular depth estimation. In contrast to the well-known approach where full 3D scenes of a virtual world are utilized to generate artificial datasets, ANYU was created by incorporating RGB-D representations of virtual reality objects into the original NYU depth v2 images. We specifically did not match each generated virtual object with an appropriate texture and a suitable location within the real-world image. Instead, an assignment of texture, location, lighting, and other rendering parameters was randomized to maximize a diversity of the training data, and to show that it is randomness that can improve the generalizing ability of a dataset. By conducting extensive experiments with our virtually modified dataset and validating on the original NYU depth v2 and iBims-1 benchmarks, we show that ANYU improves the monocular depth estimation performance and generalization of deep neural networks with considerably different architectures, especially for the current state-of-the-art VPD model. To the best of our knowledge, this is the first work that augments a real-world dataset with randomly generated virtual 3D objects for monocular depth estimation. We make our ANYU dataset publicly available in two training configurations with 10% and 100% additional synthetically enriched RGB-D pairs of training images, respectively, for efficient training and empirical exploration of virtual augmentation at https://github.com/ABrain-One/ANYU

Virtually Enriched NYU Depth V2 Dataset for Monocular Depth Estimation: Do We Need Artificial Augmentation?

TL;DR

The authors address limitations in indoor monocular depth estimation data by introducing ANYU, a virtually augmented NYU Depth V2 dataset created by randomly embedding VR objects into real RGB-D frames using Unity. ANYU is released in 10% and 100% augmentation configurations and evaluated with both diffusion-based VPD and transformer-based PixelFormer models, demonstrating improved depth estimation and generalization on NYU-v2 and iBims-1. The key finding is that randomized virtual augmentation can meaningfully enhance model robustness across architectures, with notable gains at modest augmentation levels and continued cross-dataset benefits, culminating in state-of-the-art results for at least one model. The work provides practical guidance for virtual augmentation strategies and publicly releases the ANYU dataset for broader adoption in indoor monocular depth estimation.

Abstract

We present ANYU, a new virtually augmented version of the NYU depth v2 dataset, designed for monocular depth estimation. In contrast to the well-known approach where full 3D scenes of a virtual world are utilized to generate artificial datasets, ANYU was created by incorporating RGB-D representations of virtual reality objects into the original NYU depth v2 images. We specifically did not match each generated virtual object with an appropriate texture and a suitable location within the real-world image. Instead, an assignment of texture, location, lighting, and other rendering parameters was randomized to maximize a diversity of the training data, and to show that it is randomness that can improve the generalizing ability of a dataset. By conducting extensive experiments with our virtually modified dataset and validating on the original NYU depth v2 and iBims-1 benchmarks, we show that ANYU improves the monocular depth estimation performance and generalization of deep neural networks with considerably different architectures, especially for the current state-of-the-art VPD model. To the best of our knowledge, this is the first work that augments a real-world dataset with randomly generated virtual 3D objects for monocular depth estimation. We make our ANYU dataset publicly available in two training configurations with 10% and 100% additional synthetically enriched RGB-D pairs of training images, respectively, for efficient training and empirical exploration of virtual augmentation at https://github.com/ABrain-One/ANYU
Paper Structure (14 sections, 24 figures, 3 tables)

This paper contains 14 sections, 24 figures, 3 tables.

Figures (24)

  • Figure 1: Examples of virtually augmented NYU-v2 RGB-D training pairs. Columns 1 and 3 show augmented RGB images, columns 2 and 4 --- the corresponding depth maps.
  • Figure 2: Performance breakdown of the VPD model Zhao_2023 trained on the NYU-v2 dataset expanded up to a factor of 2 (100%) with virtualized RGB-D training images. All commonly used error metrics (RMSE↓, REL↓, $\log_{10}$↓) and performance metrics ($\delta_{1}$↑, $\delta_{2}$↑, $\delta_{3}$↑) of the depth estimation show improvement over the results obtained on the original NYU-v2 dataset (0% of augmented images, abscissa axis).
  • Figure 3: Performance of the VPD model Zhao_2023 tested on virtually modified NYU-v2 test set after training on the NYU-v2 dataset expanded up to a factor of 2 with virtualized RGB-D images.
  • Figure 4: Sample visual results obtained with the VPD model Zhao_2023 using the proposed augmented NYU-v2 dataset (ANYU).
  • Figure 5: Sample visual results and the corresponding crops obtained with the VPD model Zhao_2023 trained on the original and augmented NYU-v2 datasets. One can observe clearer, better-drawn objects and their contours when data augmentation is used.
  • ...and 19 more figures