Table of Contents
Fetching ...

What You See is What You GAN: Rendering Every Pixel for High-Fidelity Geometry in 3D GANs

Alex Trevithick, Matthew Chan, Towaki Takikawa, Umar Iqbal, Shalini De Mello, Manmohan Chandraker, Ravi Ramamoorthi, Koki Nagano

TL;DR

This work tackles the memory bottleneck of neural volume rendering in 3D GANs by enabling full-resolution pixel rendering with a learned per-ray sampler, achieving strict view-consistency and unprecedented geometric detail without 2D super-resolution. It introduces an SDF-based VolSDF representation with spatially varying surface tightness and a high-resolution proposal network that predicts high-resolution sampling distributions from a cheap low-resolution probe. The method leverages robust, stratified sampling and regularization to render with as few as $20$ samples per ray, matching SR-based baselines in image quality while surpassing prior methods in geometric accuracy, demonstrated on FFHQ and AFHQ. This approach advances unsupervised learning of detailed 3D shapes from in-the-wild 2D images, enabling high-fidelity 3D content and novel view synthesis without explicit 3D supervision.

Abstract

3D-aware Generative Adversarial Networks (GANs) have shown remarkable progress in learning to generate multi-view-consistent images and 3D geometries of scenes from collections of 2D images via neural volume rendering. Yet, the significant memory and computational costs of dense sampling in volume rendering have forced 3D GANs to adopt patch-based training or employ low-resolution rendering with post-processing 2D super resolution, which sacrifices multiview consistency and the quality of resolved geometry. Consequently, 3D GANs have not yet been able to fully resolve the rich 3D geometry present in 2D images. In this work, we propose techniques to scale neural volume rendering to the much higher resolution of native 2D images, thereby resolving fine-grained 3D geometry with unprecedented detail. Our approach employs learning-based samplers for accelerating neural rendering for 3D GAN training using up to 5 times fewer depth samples. This enables us to explicitly "render every pixel" of the full-resolution image during training and inference without post-processing superresolution in 2D. Together with our strategy to learn high-quality surface geometry, our method synthesizes high-resolution 3D geometry and strictly view-consistent images while maintaining image quality on par with baselines relying on post-processing super resolution. We demonstrate state-of-the-art 3D gemetric quality on FFHQ and AFHQ, setting a new standard for unsupervised learning of 3D shapes in 3D GANs.

What You See is What You GAN: Rendering Every Pixel for High-Fidelity Geometry in 3D GANs

TL;DR

This work tackles the memory bottleneck of neural volume rendering in 3D GANs by enabling full-resolution pixel rendering with a learned per-ray sampler, achieving strict view-consistency and unprecedented geometric detail without 2D super-resolution. It introduces an SDF-based VolSDF representation with spatially varying surface tightness and a high-resolution proposal network that predicts high-resolution sampling distributions from a cheap low-resolution probe. The method leverages robust, stratified sampling and regularization to render with as few as samples per ray, matching SR-based baselines in image quality while surpassing prior methods in geometric accuracy, demonstrated on FFHQ and AFHQ. This approach advances unsupervised learning of detailed 3D shapes from in-the-wild 2D images, enabling high-fidelity 3D content and novel view synthesis without explicit 3D supervision.

Abstract

3D-aware Generative Adversarial Networks (GANs) have shown remarkable progress in learning to generate multi-view-consistent images and 3D geometries of scenes from collections of 2D images via neural volume rendering. Yet, the significant memory and computational costs of dense sampling in volume rendering have forced 3D GANs to adopt patch-based training or employ low-resolution rendering with post-processing 2D super resolution, which sacrifices multiview consistency and the quality of resolved geometry. Consequently, 3D GANs have not yet been able to fully resolve the rich 3D geometry present in 2D images. In this work, we propose techniques to scale neural volume rendering to the much higher resolution of native 2D images, thereby resolving fine-grained 3D geometry with unprecedented detail. Our approach employs learning-based samplers for accelerating neural rendering for 3D GAN training using up to 5 times fewer depth samples. This enables us to explicitly "render every pixel" of the full-resolution image during training and inference without post-processing superresolution in 2D. Together with our strategy to learn high-quality surface geometry, our method synthesizes high-resolution 3D geometry and strictly view-consistent images while maintaining image quality on par with baselines relying on post-processing super resolution. We demonstrate state-of-the-art 3D gemetric quality on FFHQ and AFHQ, setting a new standard for unsupervised learning of 3D shapes in 3D GANs.
Paper Structure (25 sections, 12 equations, 8 figures, 4 tables)

This paper contains 25 sections, 12 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Left: Our results. The split view in the middle demonstrates the high degree of agreement between our 2D rendering and corresponding 3D geometry. Our method can learn fine-grained 3D details (e.g., eyeglass frame and cat's fur) that are geometrically well-aligned to 2D images without multiview or 3D scan data. Right: Comparison with EG3D eg3d2022. Our tight SDF prior provides smooth and detailed surfaces on the face and hat while EG3D exhibits geometry artifacts and discrepancies between geometry and rendering. Please see Fig. \ref{['fig:result']} and the accompanying video for more examples, and Fig. \ref{['fig:comparison']} for comparison to other baselines.
  • Figure 2: Samples from EG3D 256 model. Right: Volume rendering with 48 coarse samples and 48 fine samples per ray with two-pass importance sampling mildenhall2020nerf results in undersampling, leading to noticeable noisy artifacts. Left: These artifacts are repaired by super resolution (SR). An unsharp mask has been applied to the zoomed views for presentation purposes.
  • Figure 3: Here we show our proposed pipeline and its intermediate outputs. Beginning from the triplane $T$, we trace uniform samples to probe the scene, yielding low-resolution $I_{128}$ and weights $P_{128}$. These are fed to a CNN which produces high-resolution proposal weights $\hat{P}_{512}$ (weights are visualized as uniform level sets). We perform robust sampling and volume render to get the final image $I_{512}$ and the surface variance $B$.
  • Figure 4: We visualize the volume rendering PDFs for the green pixel in the images on the right along with sampling methods. The ground truth distribution in blue is bimodal due to the discontinuous depth. Without stratification, the samples from the predicted yellow PDF completely miss the second mode. Stratification reduces the variance, yet also misses the second mode. Our robust stratified samples hit both modes despite the inaccurate predictions. The supervision PDF is visualized in purple as well.
  • Figure 5: Curated samples on FFHQ and AFHQ. Our method can resolve high-fidelity geometry (e.g., eyeglasses) and fine-grained details (e.g., stubble hair and cat's fur) as seen in the geometry and normal map.
  • ...and 3 more figures