Table of Contents
Fetching ...

DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, Xuanmao Li, Xingpeng Sun, Rohan Ashok, Aniruddha Mukherjee, Hao Kang, Xiangrui Kong, Gang Hua, Tianyi Zhang, Bedrich Benes, Aniket Bera

TL;DR

DL3DV-10K introduces a large-scale real-world multi-view scene dataset for deep learning-based 3D vision, addressing the scarcity of diverse benchmarks. It comprises 51.3 million frames from 10,510 4K videos over 65 POI types, and introduces DL3DV-140 as a comprehensive NVS benchmark to evaluate state-of-the-art methods under challenging lighting, reflection, and texture conditions. The paper reports that pretraining NeRF-style models on DL3DV-10K improves generalization to unseen real-world scenes, supporting the vision of learning universal 3D priors from large-scale data. The dataset, benchmark, and preliminary generalizable NeRF results establish a foundation toward scalable, real-world 3D representation learning.

Abstract

We have witnessed significant progress in deep learning-based 3D vision, ranging from neural radiance field (NeRF) based 3D representation learning to applications in novel view synthesis (NVS). However, existing scene-level datasets for deep learning-based 3D vision, limited to either synthetic environments or a narrow selection of real-world scenes, are quite insufficient. This insufficiency not only hinders a comprehensive benchmark of existing methods but also caps what could be explored in deep learning-based 3D analysis. To address this critical gap, we present DL3DV-10K, a large-scale scene dataset, featuring 51.2 million frames from 10,510 videos captured from 65 types of point-of-interest (POI) locations, covering both bounded and unbounded scenes, with different levels of reflection, transparency, and lighting. We conducted a comprehensive benchmark of recent NVS methods on DL3DV-10K, which revealed valuable insights for future research in NVS. In addition, we have obtained encouraging results in a pilot study to learn generalizable NeRF from DL3DV-10K, which manifests the necessity of a large-scale scene-level dataset to forge a path toward a foundation model for learning 3D representation. Our DL3DV-10K dataset, benchmark results, and models will be publicly accessible at https://dl3dv-10k.github.io/DL3DV-10K/.

DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision

TL;DR

DL3DV-10K introduces a large-scale real-world multi-view scene dataset for deep learning-based 3D vision, addressing the scarcity of diverse benchmarks. It comprises 51.3 million frames from 10,510 4K videos over 65 POI types, and introduces DL3DV-140 as a comprehensive NVS benchmark to evaluate state-of-the-art methods under challenging lighting, reflection, and texture conditions. The paper reports that pretraining NeRF-style models on DL3DV-10K improves generalization to unseen real-world scenes, supporting the vision of learning universal 3D priors from large-scale data. The dataset, benchmark, and preliminary generalizable NeRF results establish a foundation toward scalable, real-world 3D representation learning.

Abstract

We have witnessed significant progress in deep learning-based 3D vision, ranging from neural radiance field (NeRF) based 3D representation learning to applications in novel view synthesis (NVS). However, existing scene-level datasets for deep learning-based 3D vision, limited to either synthetic environments or a narrow selection of real-world scenes, are quite insufficient. This insufficiency not only hinders a comprehensive benchmark of existing methods but also caps what could be explored in deep learning-based 3D analysis. To address this critical gap, we present DL3DV-10K, a large-scale scene dataset, featuring 51.2 million frames from 10,510 videos captured from 65 types of point-of-interest (POI) locations, covering both bounded and unbounded scenes, with different levels of reflection, transparency, and lighting. We conducted a comprehensive benchmark of recent NVS methods on DL3DV-10K, which revealed valuable insights for future research in NVS. In addition, we have obtained encouraging results in a pilot study to learn generalizable NeRF from DL3DV-10K, which manifests the necessity of a large-scale scene-level dataset to forge a path toward a foundation model for learning 3D representation. Our DL3DV-10K dataset, benchmark results, and models will be publicly accessible at https://dl3dv-10k.github.io/DL3DV-10K/.
Paper Structure (43 sections, 17 figures, 4 tables)

This paper contains 43 sections, 17 figures, 4 tables.

Figures (17)

  • Figure 1: We introduce DL3DV-10K, a large-scale, scene dataset capturing real-world scenarios. DL3DV-10K contains 10,510 videos at 4K resolution spanning 65 types of point-of-interest (POI) locations, covering a wide range of everyday areas. With the fine-grained annotation on scene diversity and complexity, DL3DV-10K enables a comprehensive benchmark for novel view synthesis and supports learning-based 3D representation techniques in acquiring a universal prior at scale.
  • Figure 2: The efficient data acquisition pipeline of DL3DV-10K. Refer to supplementary materials for more visual illustrations of scene coverage.
  • Figure 3: We show the distribution of scene category (the primary POI locations) by complexity indices, including environmental setting, light conditions, reflective surface, and transparent materials. Attributes in light conditions include: natural light ('nlight'), artificial light ('alight'), and a combination of both ('mlight'). Reflection class includes 'more', 'medium', 'less', and 'none'. Transparency class likewise.
  • Figure 4: A presents the density plot of PSNR and SSIM and their relationship on DL3DV-140 for each method. B describes the performance comparison by scene complexity. The text above the bar plot is the mean value of the methods on the attribute.
  • Figure 5: We compare the SOTA NVS methods and the corresponding ground truth images on DL3DV-140 from held-out test views. More examples can be found in supplementary materials. The scenes are classified by complexity indices: indoor vs. outdoor, more-ref vs. less-ref, high-freq vs. low-freq, and more-transp vs. less-transp. Best view by zooming in.
  • ...and 12 more figures