Table of Contents
Fetching ...

Multi-view 3D Models from Single Images with a Convolutional Network

Maxim Tatarchenko, Alexey Dosovitskiy, Thomas Brox

TL;DR

This work tackles 3D reconstruction from a single image by learning an implicit 3D representation within a feed-forward encoder–decoder CNN that can render unseen views and predict depth maps. By generating multiple views and depth maps, the model fuses them into a 3D point cloud and refines a mesh, effectively performing 3D reconstruction without explicit 3D models. Trained on synthetic ShapeNet data with realistic backgrounds, the approach achieves high-quality unseen-view predictions and 3D reconstructions, and generalizes to real images, outperforming nearest-neighbor baselines and prior deep-learning methods. The study also analyzes view-dependence, latent-space interpolation, and internal representations, highlighting the practical potential for single-image 3D reasoning in applications such as AR/VR and robotics.

Abstract

We present a convolutional network capable of inferring a 3D representation of a previously unseen object given a single image of this object. Concretely, the network can predict an RGB image and a depth map of the object as seen from an arbitrary view. Several of these depth maps fused together give a full point cloud of the object. The point cloud can in turn be transformed into a surface mesh. The network is trained on renderings of synthetic 3D models of cars and chairs. It successfully deals with objects on cluttered background and generates reasonable predictions for real images of cars.

Multi-view 3D Models from Single Images with a Convolutional Network

TL;DR

This work tackles 3D reconstruction from a single image by learning an implicit 3D representation within a feed-forward encoder–decoder CNN that can render unseen views and predict depth maps. By generating multiple views and depth maps, the model fuses them into a 3D point cloud and refines a mesh, effectively performing 3D reconstruction without explicit 3D models. Trained on synthetic ShapeNet data with realistic backgrounds, the approach achieves high-quality unseen-view predictions and 3D reconstructions, and generalizes to real images, outperforming nearest-neighbor baselines and prior deep-learning methods. The study also analyzes view-dependence, latent-space interpolation, and internal representations, highlighting the practical potential for single-image 3D reasoning in applications such as AR/VR and robotics.

Abstract

We present a convolutional network capable of inferring a 3D representation of a previously unseen object given a single image of this object. Concretely, the network can predict an RGB image and a depth map of the object as seen from an arbitrary view. Several of these depth maps fused together give a full point cloud of the object. The point cloud can in turn be transformed into a surface mesh. The network is trained on renderings of synthetic 3D models of cars and chairs. It successfully deals with objects on cluttered background and generates reasonable predictions for real images of cars.

Paper Structure

This paper contains 24 sections, 4 equations, 17 figures, 1 table.

Figures (17)

  • Figure 1: Our network infers an object's 3D representation from a single input image. It then predicts unseen views of this object and their depth maps. Multiple such views are fused into a full 3D point cloud, which is further optimized to obtain a mesh.
  • Figure 2: The architecture of our network. The encoder (blue) turns an input image into an abstract 3D representation. The decoder (green) processes the angle, modifies the encoded hidden representation accordingly, and renders the final image together with the depth map.
  • Figure 3: Train-test split of cars. Sample renderings and their nearest neighbors are shown. Each row shows on the left a rendering of a query model from the test set together with several HOG space nearest neighbors from the training set. The two query models on the right are 'difficult' ones.
  • Figure 4: Predictions of the network (top row for each model) and the corresponding ground truth images (bottom row for each model). The input to the network is in the leftmost column for each model. The top right model is a "difficult" car.
  • Figure 5: Depth map predictions (top row) and the corresponding ground truth (bottom row). The network correctly estimates the shape.
  • ...and 12 more figures