3D generation on ImageNet

Ivan Skorokhodov; Aliaksandr Siarohin; Yinghao Xu; Jian Ren; Hsin-Ying Lee; Peter Wonka; Sergey Tulyakov

3D generation on ImageNet

Ivan Skorokhodov, Aliaksandr Siarohin, Yinghao Xu, Jian Ren, Hsin-Ying Lee, Peter Wonka, Sergey Tulyakov

TL;DR

This work introduces 3DGP, a 3D generator with Generic Priors that enables scalable 3D‑aware synthesis on non‑aligned, multi‑category datasets like ImageNet. It couples a learnable Ball‑in‑Sphere camera distribution, an adversarial depth supervision pipeline with a depth adaptor, and a discriminator knowledge distillation mechanism to guide geometry from imperfect depth predictions. The approach yields improved texture and geometry over state‑of‑the‑art 3D GANs on SDIP Dogs, SDIP Elephants, LSUN Horses, and ImageNet, and achieves more stable training with faster convergence. While still lagging behind 2D baselines in raw visual quality and facing geometry evaluation challenges, 3DGP represents a practical step toward 3D‑aware synthesis on in‑the‑wild, multi‑category data; code and data will be released.

Abstract

Existing 3D-from-2D generators are typically designed for well-curated single-category datasets, where all the objects have (approximately) the same scale, 3D location, and orientation, and the camera always points to the center of the scene. This makes them inapplicable to diverse, in-the-wild datasets of non-alignable scenes rendered from arbitrary camera poses. In this work, we develop a 3D generator with Generic Priors (3DGP): a 3D synthesis framework with more general assumptions about the training data, and show that it scales to very challenging datasets, like ImageNet. Our model is based on three new ideas. First, we incorporate an inaccurate off-the-shelf depth estimator into 3D GAN training via a special depth adaptation module to handle the imprecision. Then, we create a flexible camera model and a regularization strategy for it to learn its distribution parameters during training. Finally, we extend the recent ideas of transferring knowledge from pre-trained classifiers into GANs for patch-wise trained models by employing a simple distillation-based technique on top of the discriminator. It achieves more stable training than the existing methods and speeds up the convergence by at least 40%. We explore our model on four datasets: SDIP Dogs 256x256, SDIP Elephants 256x256, LSUN Horses 256x256, and ImageNet 256x256, and demonstrate that 3DGP outperforms the recent state-of-the-art in terms of both texture and geometry quality. Code and visualizations: https://snap-research.github.io/3dgp.

3D generation on ImageNet

TL;DR

Abstract

Paper Structure (26 sections, 10 equations, 17 figures, 7 tables)

This paper contains 26 sections, 10 equations, 17 figures, 7 tables.

Introduction
Related work
Method
Learnable "Ball-in-Sphere" camera distribution
Adversarial depth supervision
Knowledge distillation for Discriminator
Training
Experimental Results
3D generation for single category datasets
3D synthesis on ImageNet
Conclusion
Reproducibility Statement
Ethics Statement
Limitations
Implementation details
...and 11 more sections

Figures (17)

Figure 1: Selected samples from EG3D EG3D and our generator trained on ImageNet $256^2$ImageNet. EG3D models the geometry in low resolution and renders either flat shapes (when trained with the default camera distribution) or repetitive "layered" ones (when trained with a wide camera distribution). In contrast, our model synthesizes the radiance field in the full dataset resolution and learns high-fidelity details during training. Zoom-in for a better view.
Figure 2: Model overview. Left: our tri-plane-based generator. To synthesize an image, we first sample camera parameters from a prior distribution and pass them to the camera generator. This gives the posterior camera parameters, used to render an image and its depth map. The depth adaptor mitigates the distribution gap between the rendered and the predicted depth. Right: our discriminator receives a 4-channel color-depth pair as an input. A fake sample consists of the RGB image and its (adapted) depth map. A real sample consists of a real image and its estimated depth. Our two-headed discriminator predicts adversarial scores and image features for knowledge distillation.
Figure 3: Camera model. (a) Conventional camera model is designed for aligned datasets and uses just 2 degrees of freedom. (b) The proposed "Ball-in-Sphere" parametrization has 4 additional degrees of freedom: field of view and the look at position. Variable parameters are shown in blue.
Figure 4: Depth adapter. Left: An example of a real image with its depth estimated by LeReS LeReS. Note that the estimated depth has several artifacts. For example, the human legs are closer than the tail, eyes are spaced unrealistically, and far-away grass is predicted to be close. Middle: depth adapter meant to bridge the domains of predicted and NeRF-rendered depth. Right: a generated image with its adapted depth maps obtained from different layers of the adapter.
Figure 5: Qualitative multi-view comparisons. Left: samples from the models trained on single-category datasets with articulated geometry. Two views are shown for each sample. Middle: ablations following Tab. \ref{['tab:2d-experiments']}, where we change the probability of using the normalized rendered depth $P(\bar{\bm d})$. EG3D, EpiGRAF, and $P(\bar{\bm d})=0$ do not render realistic side views, due to the underlying flat geometry. Our full model instead generates realistic high-quality views on all the datasets. Right: randomly sampled real images. Zoom-in for greater detail.
...and 12 more figures

3D generation on ImageNet

TL;DR

Abstract

3D generation on ImageNet

Authors

TL;DR

Abstract

Table of Contents

Figures (17)