Table of Contents
Fetching ...

LN3DIFF++: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation

Yushi Lan, Fangzhou Hong, Shangchen Zhou, Shuai Yang, Xuyi Meng, Yongwei Chen, Zhaoyang Lyu, Bo Dai, Xingang Pan, Chen Change Loy

TL;DR

<3-5 sentence high-level summary> LN3Diff++ introduces a scalable 3D diffusion pipeline that operates in a compact 3D-aware latent space learned by a VAE to enable fast, conditional 3D generation across categories. A 3D-aware transformer-based decoder maps latent codes to high-capacity 3D neural fields, while diffusion training occurs in this latent space using a DiT-based denoiser and flexible conditioning (text/image, with DINO/CLIP features). The approach delivers state-of-the-art performance on ShapeNet for 3D generation and strong monocular reconstruction across ShapeNet, FFHQ, and Objaverse, with significantly faster inference than prior latent-free methods. Ablation studies confirm the importance of the 3D latent design, novel-view supervision for monocular data, and conditioning strategies for fidelity and controllability. The work demonstrates a practical pathway to generic, high-quality 3D generation suitable for broader 3D vision and graphics tasks, while acknowledging limitations in memory usage and potential improvements in explicit 3D representations and compositionality.

Abstract

The field of neural rendering has witnessed significant progress with advancements in generative models and differentiable rendering techniques. Though 2D diffusion has achieved success, a unified 3D diffusion pipeline remains unsettled. This paper introduces a novel framework called LN3Diff++ to address this gap and enable fast, high-quality, and generic conditional 3D generation. Our approach harnesses a 3D-aware architecture and variational autoencoder (VAE) to encode the input image into a structured, compact, and 3D latent space. The latent is decoded by a transformer-based decoder into a high-capacity 3D neural field. Through training a diffusion model on this 3D-aware latent space, our method achieves state-of-the-art performance on ShapeNet for 3D generation and demonstrates superior performance in monocular 3D reconstruction and conditional 3D generation across various datasets. Moreover, it surpasses existing 3D diffusion methods in terms of inference speed, requiring no per-instance optimization. Our proposed LN3Diff presents a significant advancement in 3D generative modeling and holds promise for various applications in 3D vision and graphics tasks.

LN3DIFF++: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation

TL;DR

<3-5 sentence high-level summary> LN3Diff++ introduces a scalable 3D diffusion pipeline that operates in a compact 3D-aware latent space learned by a VAE to enable fast, conditional 3D generation across categories. A 3D-aware transformer-based decoder maps latent codes to high-capacity 3D neural fields, while diffusion training occurs in this latent space using a DiT-based denoiser and flexible conditioning (text/image, with DINO/CLIP features). The approach delivers state-of-the-art performance on ShapeNet for 3D generation and strong monocular reconstruction across ShapeNet, FFHQ, and Objaverse, with significantly faster inference than prior latent-free methods. Ablation studies confirm the importance of the 3D latent design, novel-view supervision for monocular data, and conditioning strategies for fidelity and controllability. The work demonstrates a practical pathway to generic, high-quality 3D generation suitable for broader 3D vision and graphics tasks, while acknowledging limitations in memory usage and potential improvements in explicit 3D representations and compositionality.

Abstract

The field of neural rendering has witnessed significant progress with advancements in generative models and differentiable rendering techniques. Though 2D diffusion has achieved success, a unified 3D diffusion pipeline remains unsettled. This paper introduces a novel framework called LN3Diff++ to address this gap and enable fast, high-quality, and generic conditional 3D generation. Our approach harnesses a 3D-aware architecture and variational autoencoder (VAE) to encode the input image into a structured, compact, and 3D latent space. The latent is decoded by a transformer-based decoder into a high-capacity 3D neural field. Through training a diffusion model on this 3D-aware latent space, our method achieves state-of-the-art performance on ShapeNet for 3D generation and demonstrates superior performance in monocular 3D reconstruction and conditional 3D generation across various datasets. Moreover, it surpasses existing 3D diffusion methods in terms of inference speed, requiring no per-instance optimization. Our proposed LN3Diff presents a significant advancement in 3D generative modeling and holds promise for various applications in 3D vision and graphics tasks.
Paper Structure (21 sections, 6 equations, 15 figures, 7 tables)

This paper contains 21 sections, 6 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: We present LN3Diff++, which performs efficient 3D diffusion learning over a compact latent space. Compared to LN3Diff which adopts NeRF rendering and supports text-conditioned 3D generation, LN3Diff++ further enables SDF-based 3D representation and image-conditioned 3D generation. The resulting model enables both high-quality monocular 3D reconstruction and text-to-3D synthesis.
  • Figure 2: Pipeline of LN3Diff++. In the 3D latent space learning stage, a convolutional encoder $\mathcal{E}_{\boldsymbol{\phi}}$ encodes a set of images $\mathcal{I}$ into the KL-regularized latent space. The encoded 3D latent is further decoded by a 3D-aware DiT transformer $\mathcal{D}_T$, in which we perform self-plane attention and cross-plane attention. The transformer-decoded latent is up-sampled by a convolutional upsampler $\mathcal{D}_U$ towards a high-res tri-plane for rendering supervisions. In the next stage, we perform conditional diffusion learning over the compact latent space using either U-Net or DiT. The detailed architecture of DiT is shown in Fig. \ref{['fig:dit']}.
  • Figure 3: Diffusion training of LN3Diff++.We adopt DiT architecture with AdaLN-single chen2023pixartalpha and QK-Norm megavitesser2020taming. For both conditioning modalities, we incorporate the conditional features using attention mechanisms. Specifically, for CLIP-based conditioning, we employ cross-attention blocks to inject the condition, following the approach used in PixelArt chen2023pixartalpha. For image-conditioned 3D generation, we additionally concatenate DINO patch features into the self-attention block.
  • Figure 4: ShapeNet Unconditional Generation. We show four samples for each method. Zoom in for the best view.
  • Figure 5: ShapeNet Conditional Generation. We show conditional generation with both texts and image as inputs. Zoom in for the best view.
  • ...and 10 more figures