Table of Contents
Fetching ...

Geometry aware 3D generation from in-the-wild images in ImageNet

Qijia Shen, Guangrun Wang

TL;DR

This work tackles the challenge of learning 3D generative models from in-the-wild 2D images without camera pose annotations, using a triplane-based geometry representation and a strengthened StyleGAN2 backbone. It introduces multi-view discrimination to stabilize GAN training on diverse data and enables rendering from arbitrary viewpoints, achieving class-conditioned 3D generation directly from ImageNet and validating on ShapeNet as well. A pivotal component is the two-stage PTI-based single-view reconstruction, enabling efficient 3D shape completion from a single image. Overall, the method demonstrates significant quantitative and qualitative improvements over prior 3D-aware GANs and provides a scalable pathway toward large-scale 3D generation from web-scale datasets.

Abstract

Generating accurate 3D models is a challenging problem that traditionally requires explicit learning from 3D datasets using supervised learning. Although recent advances have shown promise in learning 3D models from 2D images, these methods often rely on well-structured datasets with multi-view images of each instance or camera pose information. Furthermore, these datasets usually contain clean backgrounds with simple shapes, making them expensive to acquire and hard to generalize, which limits the applicability of these methods. To overcome these limitations, we propose a method for reconstructing 3D geometry from the diverse and unstructured Imagenet dataset without camera pose information. We use an efficient triplane representation to learn 3D models from 2D images and modify the architecture of the generator backbone based on StyleGAN2 to adapt to the highly diverse dataset. To prevent mode collapse and improve the training stability on diverse data, we propose to use multi-view discrimination. The trained generator can produce class-conditional 3D models as well as renderings from arbitrary viewpoints. The class-conditional generation results demonstrate significant improvement over the current state-of-the-art method. Additionally, using PTI, we can efficiently reconstruct the whole 3D geometry from single-view images.

Geometry aware 3D generation from in-the-wild images in ImageNet

TL;DR

This work tackles the challenge of learning 3D generative models from in-the-wild 2D images without camera pose annotations, using a triplane-based geometry representation and a strengthened StyleGAN2 backbone. It introduces multi-view discrimination to stabilize GAN training on diverse data and enables rendering from arbitrary viewpoints, achieving class-conditioned 3D generation directly from ImageNet and validating on ShapeNet as well. A pivotal component is the two-stage PTI-based single-view reconstruction, enabling efficient 3D shape completion from a single image. Overall, the method demonstrates significant quantitative and qualitative improvements over prior 3D-aware GANs and provides a scalable pathway toward large-scale 3D generation from web-scale datasets.

Abstract

Generating accurate 3D models is a challenging problem that traditionally requires explicit learning from 3D datasets using supervised learning. Although recent advances have shown promise in learning 3D models from 2D images, these methods often rely on well-structured datasets with multi-view images of each instance or camera pose information. Furthermore, these datasets usually contain clean backgrounds with simple shapes, making them expensive to acquire and hard to generalize, which limits the applicability of these methods. To overcome these limitations, we propose a method for reconstructing 3D geometry from the diverse and unstructured Imagenet dataset without camera pose information. We use an efficient triplane representation to learn 3D models from 2D images and modify the architecture of the generator backbone based on StyleGAN2 to adapt to the highly diverse dataset. To prevent mode collapse and improve the training stability on diverse data, we propose to use multi-view discrimination. The trained generator can produce class-conditional 3D models as well as renderings from arbitrary viewpoints. The class-conditional generation results demonstrate significant improvement over the current state-of-the-art method. Additionally, using PTI, we can efficiently reconstruct the whole 3D geometry from single-view images.
Paper Structure (16 sections, 6 figures, 4 tables)

This paper contains 16 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Selected examples of 3D models and synthesized views respectively. Using in-the-wild images on Imagenet for training, our method successfully learned a class-conditional 3D generative model.
  • Figure 2: Structure of our method. We add some addtional layers to the StyleGAN2 generator. The original StyleGAN2 is composed of sequential synethesis blocks. Each block doubles the resolution of the feature maps in the previous block, as indicated by the green blocks in the figure. After each original synthesis block, we add an additional synthesis block with same structure, indicated by orange block in the figure, except $2\times$ upsampling. We also increase the depth of the decoder by adding additional linear layers (purple blocks in the figure). Renderer will synthesize three views using the camera poses sampled uniformly from a sphere.
  • Figure 3: Single-view image inversion is separated into two stages. Latent vector is optimized in the first stage and the generator is optimized in the second stage. The optimization goal is to minimize the distance between the generated image and the target image in both feature space and image space. The camera pose can be arbitrarily chosen, as long as it remains the same during the whole inversion process.
  • Figure 4: We present selected examples of 3D models generated using our method, along with three different views synthesized from each corresponding model. In this demonstration, we showcase six examples spanning from plants to architecture and everyday objects. These categories represent challenging cases for 3D generative models in the training set, with the Yellow Lady's Slipper having a highly complex background, the daisy having a limited view with a fixed angle, and the cheeseburger being deformed. However, our method has successfully learned the 3D models of these objects, demonstrating its potential and effectiveness.
  • Figure 5: Comparision of samples generated by EG3D and ours. Five classes from top to bottom are parachute, Polaroid camera, wine bottle, yawl and tripod. Our method demonstrates a superior ability to learn the 3D shape of various categories, and it excels at handling challenging cases where multiple instances are presented in the same image. This is particularly evident in the images on the third row, which showcase our method's capability to handle such cases.
  • ...and 1 more figures