Table of Contents
Fetching ...

G3DR: Generative 3D Reconstruction in ImageNet

Pradyumna Reddy, Ismail Elezi, Jiankang Deng

TL;DR

G3DR tackles single-view 3D reconstruction in large, diverse datasets by integrating a diffusion-based rgbd generator with a triplane-based 3D decoder and a novel depth-regularization scheme to preserve geometry. It leverages CLIP for novel-view supervision and employs a multi-resolution sampling strategy to boost texture realism without increasing model size. The method achieves state-of-the-art geometry and competitive perceptual quality on ImageNet, with strong results across fine-grained datasets and text-conditioned generation, while significantly reducing training time. This approach offers a scalable path toward high-fidelity 3D content generation across broad categories for applications in VR/AR, gaming, and film production.

Abstract

We introduce a novel 3D generative method, Generative 3D Reconstruction (G3DR) in ImageNet, capable of generating diverse and high-quality 3D objects from single images, addressing the limitations of existing methods. At the heart of our framework is a novel depth regularization technique that enables the generation of scenes with high-geometric fidelity. G3DR also leverages a pretrained language-vision model, such as CLIP, to enable reconstruction in novel views and improve the visual realism of generations. Additionally, G3DR designs a simple but effective sampling procedure to further improve the quality of generations. G3DR offers diverse and efficient 3D asset generation based on class or text conditioning. Despite its simplicity, G3DR is able to beat state-of-theart methods, improving over them by up to 22% in perceptual metrics and 90% in geometry scores, while needing only half of the training time. Code is available at https://github.com/preddy5/G3DR

G3DR: Generative 3D Reconstruction in ImageNet

TL;DR

G3DR tackles single-view 3D reconstruction in large, diverse datasets by integrating a diffusion-based rgbd generator with a triplane-based 3D decoder and a novel depth-regularization scheme to preserve geometry. It leverages CLIP for novel-view supervision and employs a multi-resolution sampling strategy to boost texture realism without increasing model size. The method achieves state-of-the-art geometry and competitive perceptual quality on ImageNet, with strong results across fine-grained datasets and text-conditioned generation, while significantly reducing training time. This approach offers a scalable path toward high-fidelity 3D content generation across broad categories for applications in VR/AR, gaming, and film production.

Abstract

We introduce a novel 3D generative method, Generative 3D Reconstruction (G3DR) in ImageNet, capable of generating diverse and high-quality 3D objects from single images, addressing the limitations of existing methods. At the heart of our framework is a novel depth regularization technique that enables the generation of scenes with high-geometric fidelity. G3DR also leverages a pretrained language-vision model, such as CLIP, to enable reconstruction in novel views and improve the visual realism of generations. Additionally, G3DR designs a simple but effective sampling procedure to further improve the quality of generations. G3DR offers diverse and efficient 3D asset generation based on class or text conditioning. Despite its simplicity, G3DR is able to beat state-of-theart methods, improving over them by up to 22% in perceptual metrics and 90% in geometry scores, while needing only half of the training time. Code is available at https://github.com/preddy5/G3DR
Paper Structure (13 sections, 6 equations, 6 figures, 6 tables)

This paper contains 13 sections, 6 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Our method is able to generate 3D images conditioned on latents (such as class), text, or images. All the training has been done in ImageNet dataset, that contain only single-view images.
  • Figure 2: a) The architecture of our method. Our framework is conditioned on some visual input, class cateogry or text, and generates an image. Then it feeds that image over a triplane generator, and it finally renders it, ensuring good image quality and geometry using a regularization depth; b) an illustration of our kernel in 2D; the blue line on the Depth Map represents the selected cross section, in the Original Gradients we visualize high dimensional gradients using rgb channels and Scaled Gradients show how the kernel modifies the volume rendering function gradients c) the losses of our model. In the canonical view, our method uses a combination of reconstruction, perceptual and depth loss. In the novel view, it uses a combination of clip, perceptual and tv loss. The losses are scaled accordingly, while the loss gradients during backpropagation are scaled based on the kernel in (b).
  • Figure 3: An illustration of our multi-resolution sampling. We observe how increasing the sampling level directly increases the reconstruction quality.
  • Figure 4: Qualitative results of our method. In the first row we present qualitative results generated in ImageNet dataset and their corresponding depth. In the second row we show qualitative results in the finegrained datasets with their corresponding depth.
  • Figure 5: Qualitative evaluations of our method and alternative depth supervision methods.
  • ...and 1 more figures