Table of Contents
Fetching ...

3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction

Christopher B. Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, Silvio Savarese

TL;DR

<3-5 sentence high-level summary> The paper addresses the challenge of reconstructing 3D objects from limited image views, potentially from uncalibrated viewpoints, by learning an end-to-end mapping from images to 3D shapes. It introduces 3D-R2N2, a unified encoder–3D-LSTM–decoder architecture that incrementally refines a voxel-based reconstruction as more views become available, trained with minimal supervision on synthetic data. Key contributions include a 3D Convolutional LSTM with local connectivity, a 3D deconvolutional decoder, and demonstrations of single-view and multi-view reconstruction that outperform state-of-the-art single-view methods and remain robust when traditional SFM/SLAM fail. The approach generalizes to real-world images and shows competitive or superior performance against MVS under sparse or textureless conditions, highlighting its practical impact for rapid 3D prototyping and recognition in varied conditions.

Abstract

Inspired by the recent success of methods that employ shape priors to achieve robust 3D reconstructions, we propose a novel recurrent neural network architecture that we call the 3D Recurrent Reconstruction Neural Network (3D-R2N2). The network learns a mapping from images of objects to their underlying 3D shapes from a large collection of synthetic data. Our network takes in one or more images of an object instance from arbitrary viewpoints and outputs a reconstruction of the object in the form of a 3D occupancy grid. Unlike most of the previous works, our network does not require any image annotations or object class labels for training or testing. Our extensive experimental analysis shows that our reconstruction framework i) outperforms the state-of-the-art methods for single view reconstruction, and ii) enables the 3D reconstruction of objects in situations when traditional SFM/SLAM methods fail (because of lack of texture and/or wide baseline).

3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction

TL;DR

<3-5 sentence high-level summary> The paper addresses the challenge of reconstructing 3D objects from limited image views, potentially from uncalibrated viewpoints, by learning an end-to-end mapping from images to 3D shapes. It introduces 3D-R2N2, a unified encoder–3D-LSTM–decoder architecture that incrementally refines a voxel-based reconstruction as more views become available, trained with minimal supervision on synthetic data. Key contributions include a 3D Convolutional LSTM with local connectivity, a 3D deconvolutional decoder, and demonstrations of single-view and multi-view reconstruction that outperform state-of-the-art single-view methods and remain robust when traditional SFM/SLAM fail. The approach generalizes to real-world images and shows competitive or superior performance against MVS under sparse or textureless conditions, highlighting its practical impact for rapid 3D prototyping and recognition in varied conditions.

Abstract

Inspired by the recent success of methods that employ shape priors to achieve robust 3D reconstructions, we propose a novel recurrent neural network architecture that we call the 3D Recurrent Reconstruction Neural Network (3D-R2N2). The network learns a mapping from images of objects to their underlying 3D shapes from a large collection of synthetic data. Our network takes in one or more images of an object instance from arbitrary viewpoints and outputs a reconstruction of the object in the form of a 3D occupancy grid. Unlike most of the previous works, our network does not require any image annotations or object class labels for training or testing. Our extensive experimental analysis shows that our reconstruction framework i) outperforms the state-of-the-art methods for single view reconstruction, and ii) enables the 3D reconstruction of objects in situations when traditional SFM/SLAM methods fail (because of lack of texture and/or wide baseline).

Paper Structure

This paper contains 17 sections, 6 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: (a) Some sample images of the objects we wish to reconstruct - notice that views are separated by a large baseline and objects' appearance shows little texture and/or are non-lambertian. (b) An overview of our proposed 3D-R2N2: The network takes a sequence of images (or just one image) from arbitrary (uncalibrated) viewpoints as input (in this example, 3 views of the armchair) and generates voxelized 3D reconstruction as an output. The reconstruction is incrementally refined as the network sees more views of the object.
  • Figure 2: Network architecture: Each 3D-R2N2 consists of an encoder, a recurrence unit and a decoder. After every convolution layer, we place a LeakyReLU nonlinearity. The encoder converts a $127 \times 127$ RGB image into a low-dimensional feature which is then fed into the 3D-LSTM. The decoder then takes the 3D-LSTM hidden states and transforms them to a final voxel occupancy map. After each convolution layer is a LeakyReLU. We use two versions of 3D-R2N2: (top) a shallow network and (bottom) a deep residual network resnet.
  • Figure 3: (a) At each time step, each unit (purple) in the 3D-LSTM receives the same feature vector from the encoder as well as the hidden states from its neighbors (red) by a $3\times 3\times 3$ convolution ($W_s \ast h_{t-1}$) as inputs. We propose two versions of 3D-LSTMs: (b) 3D-LSTMs without output gates and (c) 3D Gated Recurrent Units (GRUs).
  • Figure 4: (a) Reconstruction samples of PASCAL VOC dataset. (b) Failed reconstructions on the PASCAL VOC dataset. Note that Kar et al. kar2015category is trained/tested per category and takes ground-truth object segmentation masks and keypoint labels as additional input.
  • Figure 5: (a), (b): Multi-view reconstruction using our model on the ShapeNet dataset. The performance is reported in median (red line) and mean (green dot) cross-entropy loss and intersection over union (IoU) values. The box plot shows 25% and 75%, with caps showing 15% and 85%. (c): Per-category reconstruction of the ShapeNet dataset using our model. The values are average IoU.
  • ...and 3 more figures