Table of Contents
Fetching ...

SceneNet: Understanding Real World Indoor Scenes With Synthetic Data

Ankur Handa, Viorica Patraucean, Vijay Badrinarayanan, Simon Stent, Roberto Cipolla

TL;DR

The paper tackles the data scarcity in indoor scene understanding by introducing SceneNet, a synthetic-depth data pipeline with annotated 3D scenes (SN-BS) and automatic, constraint-guided scene generation. By rendering depth-based frames with Kinect-like noise and training a DHA-enabled segmentation network, SceneNet achieves significant improvements on NYUv2 and SUN RGB-D compared to NYUv2-only training, and approaches state-of-the-art performance in depth-based per-pixel labeling without RGB information. Key contributions include a scalable method for generating unlimited labeled depth data, a validated improvement when fine-tuning on real data, and a benchmark for depth-only segmentation across multiple datasets. The work suggests synthetic data can substantially advance real-world indoor scene understanding and enables future exploration of video and sequential modeling.

Abstract

Scene understanding is a prerequisite to many high level tasks for any automated intelligent machine operating in real world environments. Recent attempts with supervised learning have shown promise in this direction but also highlighted the need for enormous quantity of supervised data --- performance increases in proportion to the amount of data used. However, this quickly becomes prohibitive when considering the manual labour needed to collect such data. In this work, we focus our attention on depth based semantic per-pixel labelling as a scene understanding problem and show the potential of computer graphics to generate virtually unlimited labelled data from synthetic 3D scenes. By carefully synthesizing training data with appropriate noise models we show comparable performance to state-of-the-art RGBD systems on NYUv2 dataset despite using only depth data as input and set a benchmark on depth-based segmentation on SUN RGB-D dataset. Additionally, we offer a route to generating synthesized frame or video data, and understanding of different factors influencing performance gains.

SceneNet: Understanding Real World Indoor Scenes With Synthetic Data

TL;DR

The paper tackles the data scarcity in indoor scene understanding by introducing SceneNet, a synthetic-depth data pipeline with annotated 3D scenes (SN-BS) and automatic, constraint-guided scene generation. By rendering depth-based frames with Kinect-like noise and training a DHA-enabled segmentation network, SceneNet achieves significant improvements on NYUv2 and SUN RGB-D compared to NYUv2-only training, and approaches state-of-the-art performance in depth-based per-pixel labeling without RGB information. Key contributions include a scalable method for generating unlimited labeled depth data, a validated improvement when fine-tuning on real data, and a benchmark for depth-only segmentation across multiple datasets. The work suggests synthetic data can substantially advance real-world indoor scene understanding and enables future exploration of video and sequential modeling.

Abstract

Scene understanding is a prerequisite to many high level tasks for any automated intelligent machine operating in real world environments. Recent attempts with supervised learning have shown promise in this direction but also highlighted the need for enormous quantity of supervised data --- performance increases in proportion to the amount of data used. However, this quickly becomes prohibitive when considering the manual labour needed to collect such data. In this work, we focus our attention on depth based semantic per-pixel labelling as a scene understanding problem and show the potential of computer graphics to generate virtually unlimited labelled data from synthetic 3D scenes. By carefully synthesizing training data with appropriate noise models we show comparable performance to state-of-the-art RGBD systems on NYUv2 dataset despite using only depth data as input and set a benchmark on depth-based segmentation on SUN RGB-D dataset. Additionally, we offer a route to generating synthesized frame or video data, and understanding of different factors influencing performance gains.

Paper Structure

This paper contains 13 sections, 3 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Missing labels (a) and mislabelled frames (b) are very common in many real datasets. In (b) the toilet and sink have the same ground truth label. Both images are from SUN RGB-DSong:etal:CVPR2015.
  • Figure 2: Annotated 3D models allow the generation of per-pixel semantically labelled images from arbitrary viewpoints, such as from a floor-based robot or a UAV. Just as the ImageNet Deng:etal:CVPR2009 and ModelNet Wu:etal:CVPR2015 datasets have fostered recent advances in image classification Krizhevsky:etal:NIPS2012 and 3D shape recognition Su:etal:ICCV2015, we propose SceneNet as a valuable dataset towards the goal of indoor scene understanding.
  • Figure 3: Snapshots of detailed scenes for each category in SceneNet, hosted at robotvault.bitbucket.org
  • Figure 4: Co-occurrence statistics for bedroom scenes in NYUv2 40 class labels. Warmer colours reflect higher co-occurrence frequency.
  • Figure 5: Effect of different constraints on the optimisation. With no pairwise or visibility constraints, objects appear scattered at random (a). When pairwise constraints are added, the sofa, table and TV assume sensible relative positions but with chair and vacuum cleaner occluding the view (b). With all constraints, occlusions are removed.
  • ...and 5 more figures