Table of Contents
Fetching ...

SceneNet RGB-D: 5M Photorealistic Images of Synthetic Indoor Trajectories with Ground Truth

John McCormac, Ankur Handa, Stefan Leutenegger, Andrew J. Davison

TL;DR

SceneNet RGB-D delivers a scalable, photorealistic synthetic indoor RGB-D dataset with exhaustive ground truth for semantic/instance segmentation, depth, optical flow, camera pose, and 3D reconstruction. It achieves this through automatic physics-based scene generation, randomized textures and lighting, and a GPU-accelerated photon-mapped renderer, resulting in 5 million frames across thousands of layouts. The work demonstrates the practicality of large-scale synthetic data for pretraining and SLAM-style tasks while candidly noting limitations such as static scenes and labeling gaps. This dataset provides a foundation for robust domain adaptation and temporal scene understanding in robotics and AR applications, with potential extensions to dynamic scenes and on-the-fly data generation.

Abstract

We introduce SceneNet RGB-D, expanding the previous work of SceneNet to enable large scale photorealistic rendering of indoor scene trajectories. It provides pixel-perfect ground truth for scene understanding problems such as semantic segmentation, instance segmentation, and object detection, and also for geometric computer vision problems such as optical flow, depth estimation, camera pose estimation, and 3D reconstruction. Random sampling permits virtually unlimited scene configurations, and here we provide a set of 5M rendered RGB-D images from over 15K trajectories in synthetic layouts with random but physically simulated object poses. Each layout also has random lighting, camera trajectories, and textures. The scale of this dataset is well suited for pre-training data-driven computer vision techniques from scratch with RGB-D inputs, which previously has been limited by relatively small labelled datasets in NYUv2 and SUN RGB-D. It also provides a basis for investigating 3D scene labelling tasks by providing perfect camera poses and depth data as proxy for a SLAM system. We host the dataset at http://robotvault.bitbucket.io/scenenet-rgbd.html

SceneNet RGB-D: 5M Photorealistic Images of Synthetic Indoor Trajectories with Ground Truth

TL;DR

SceneNet RGB-D delivers a scalable, photorealistic synthetic indoor RGB-D dataset with exhaustive ground truth for semantic/instance segmentation, depth, optical flow, camera pose, and 3D reconstruction. It achieves this through automatic physics-based scene generation, randomized textures and lighting, and a GPU-accelerated photon-mapped renderer, resulting in 5 million frames across thousands of layouts. The work demonstrates the practicality of large-scale synthetic data for pretraining and SLAM-style tasks while candidly noting limitations such as static scenes and labeling gaps. This dataset provides a foundation for robust domain adaptation and temporal scene understanding in robotics and AR applications, with potential extensions to dynamic scenes and on-the-fly data generation.

Abstract

We introduce SceneNet RGB-D, expanding the previous work of SceneNet to enable large scale photorealistic rendering of indoor scene trajectories. It provides pixel-perfect ground truth for scene understanding problems such as semantic segmentation, instance segmentation, and object detection, and also for geometric computer vision problems such as optical flow, depth estimation, camera pose estimation, and 3D reconstruction. Random sampling permits virtually unlimited scene configurations, and here we provide a set of 5M rendered RGB-D images from over 15K trajectories in synthetic layouts with random but physically simulated object poses. Each layout also has random lighting, camera trajectories, and textures. The scale of this dataset is well suited for pre-training data-driven computer vision techniques from scratch with RGB-D inputs, which previously has been limited by relatively small labelled datasets in NYUv2 and SUN RGB-D. It also provides a basis for investigating 3D scene labelling tasks by providing perfect camera poses and depth data as proxy for a SLAM system. We host the dataset at http://robotvault.bitbucket.io/scenenet-rgbd.html

Paper Structure

This paper contains 16 sections, 5 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: Flow chart of the different stages in our pipeline. Physically realistic scenes are created using Chrono Engine by dropping objects from the ceiling. These scenes are used to generate automated camera trajectories simulating human hand-held motion and both are passed on the rendering engine --- inspired by OptiX --- to produce RGB-D ground truth.
  • Figure 2: Hand-picked examples from our dataset. Rendered images on the left and the available ground truth information on the right.
  • Figure 3: On the left is the original photo, on the right are unique randomly coloured voxels that remain the same throughout a trajectory. Outside the window there is no depth reading so we assign all of these areas the same default identifier.
  • Figure 4: Probability distributions of heights (in m) of different objects as obtained from SUN RGB-D. It is interesting to see that some objects like cabinets and lamps clearly do have multimodal height distributions.
  • Figure 5: Top 50 objects and their log proportions by scene type. The unfortunate number of mailboxes is a result of a mistaken mapping of the 'box' class in SUN RGB-D to a class defined as box in ShapeNets, but which contains primarily mailboxes. This is an unfortunate mishap that serves to highlight some of the difficulties inherent in working with large quantities of objects and labels in an automated way. *beds are subdivided into a number of similar classes such as miscbeds, kingsized beds, and here we combine these into a coherent group.
  • ...and 7 more figures