Table of Contents
Fetching ...

GaussianAnything: Interactive Point Cloud Flow Matching For 3D Object Generation

Yushi Lan, Shangchen Zhou, Zhaoyang Lyu, Fangzhou Hong, Shuai Yang, Bo Dai, Xingang Pan, Chen Change Loy

TL;DR

GaussianAnything introduces a point-cloud structured latent space $Z=[z_x \oplus z_h]$ learned by a multi-view RGB-D-N 3D VAE and trains cascaded diffusion via flow matching to produce surfel Gaussians for 3D objects. The 3D VAE encodes $V$ views into a sparse point-cloud latent, cross-attention maps to a latent space, and decodes to high-quality surfel Gaussians, while the two-stage diffusion (geometry-first, texture-second) is conditioned by text via CLIP and by images via DINOv2. This approach yields state-of-the-art results on text- and image-conditioned 3D generation and enables geometry-texture disentanglement and interactive 3D editing, with near-complete Gaussian utilization during rendering. Overall, GaussianAnything provides a scalable, editable pathway for native 3D content generation with broad applicability to virtual reality, film, and design workflows.

Abstract

While 3D content generation has advanced significantly, existing methods still face challenges with input formats, latent space design, and output representations. This paper introduces a novel 3D generation framework that addresses these challenges, offering scalable, high-quality 3D generation with an interactive Point Cloud-structured Latent space. Our framework employs a Variational Autoencoder (VAE) with multi-view posed RGB-D(epth)-N(ormal) renderings as input, using a unique latent space design that preserves 3D shape information, and incorporates a cascaded latent flow-based model for improved shape-texture disentanglement. The proposed method, GaussianAnything, supports multi-modal conditional 3D generation, allowing for point cloud, caption, and single image inputs. Notably, the newly proposed latent space naturally enables geometry-texture disentanglement, thus allowing 3D-aware editing. Experimental results demonstrate the effectiveness of our approach on multiple datasets, outperforming existing native 3D methods in both text- and image-conditioned 3D generation.

GaussianAnything: Interactive Point Cloud Flow Matching For 3D Object Generation

TL;DR

GaussianAnything introduces a point-cloud structured latent space learned by a multi-view RGB-D-N 3D VAE and trains cascaded diffusion via flow matching to produce surfel Gaussians for 3D objects. The 3D VAE encodes views into a sparse point-cloud latent, cross-attention maps to a latent space, and decodes to high-quality surfel Gaussians, while the two-stage diffusion (geometry-first, texture-second) is conditioned by text via CLIP and by images via DINOv2. This approach yields state-of-the-art results on text- and image-conditioned 3D generation and enables geometry-texture disentanglement and interactive 3D editing, with near-complete Gaussian utilization during rendering. Overall, GaussianAnything provides a scalable, editable pathway for native 3D content generation with broad applicability to virtual reality, film, and design workflows.

Abstract

While 3D content generation has advanced significantly, existing methods still face challenges with input formats, latent space design, and output representations. This paper introduces a novel 3D generation framework that addresses these challenges, offering scalable, high-quality 3D generation with an interactive Point Cloud-structured Latent space. Our framework employs a Variational Autoencoder (VAE) with multi-view posed RGB-D(epth)-N(ormal) renderings as input, using a unique latent space design that preserves 3D shape information, and incorporates a cascaded latent flow-based model for improved shape-texture disentanglement. The proposed method, GaussianAnything, supports multi-modal conditional 3D generation, allowing for point cloud, caption, and single image inputs. Notably, the newly proposed latent space naturally enables geometry-texture disentanglement, thus allowing 3D-aware editing. Experimental results demonstrate the effectiveness of our approach on multiple datasets, outperforming existing native 3D methods in both text- and image-conditioned 3D generation.

Paper Structure

This paper contains 17 sections, 23 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Pipeline of the 3D VAE of GaussianAnything. In the 3D latent space learning stage, our proposed 3D VAE $\mathcal{E}_{\boldsymbol{\phi}}$ encodes $V-$views of posed RGB-D(epth)-N(ormal) renderings $\mathcal{R}$ into a point-cloud structured latent space. This is achieved by first processing the multi-view inputs into the un-structured set latent, which is further projected onto the 3D manifold through a cross attention block, yielding the point-cloud structured latent code ${\mathbf{z}}$. The structured 3D latent is further decoded by a 3D-aware DiT transformer, giving the coarse Gaussian prediction. For high-quality rendering, the base Gaussian is further up-sampled by a series of cascaded upsampler $\mathcal{D}_U^{k}$ towards a dense Gaussian for high-resolution rasterization. The 3D VAE training objective is detailed in Eq. (\ref{['eq:stage1_loss']}).
  • Figure 2: Diffusion training of GaussianAnything. Based on the point-cloud structure 3D VAE, we perform cascaded 3D diffusion learning given text (a) and image (b) conditions. We adopt DiT architecture with AdaLN-single chen2023pixartalpha and QK-Norm megavitesser2020taming. For both condition modality, we send in the conditional feature with cross attention block, but at different positions. The 3D generation is achieved in two stages (c), where a point cloud diffusion model first generates the 3D layout ${\mathbf{z}}_{x,0}$, and a texture diffusion model further generates the corresponding point-cloud features ${\mathbf{z}}_{h,0}$. The generated latent code ${\mathbf{z}}_0$ is decoded into the final 3D object with the pre-trained VAE decoder.
  • Figure 3: Qualitative Comparison of Image-to-3D. We showcase the novel view 3D reconstruction of all methods given a single image from unseen GSO dataset. Our proposed method achieves consistently stable performance across all cases. Note that though feed-forward 3D reconstruction methods achieve sharper texture reconstruction, these method fail to yield intact 3D predictions under challenging cases (e.g., the rhino in row 2). In contrast, our proposed native 3D diffusion model achieve consistently better performance. Better zoom in.
  • Figure 4: Qualitative Comparison of Text-to-3D. We present text-conditioned 3D objects generated by GaussianAnything, displaying two views of each sample. The top section compares our results with baseline methods, while the bottom shows additional samples from our method along with their geometry maps. Our approach consistently yields better quality in terms of geometry, texture, and text-3D alignment.
  • Figure 5: 3D editing. Given two text prompts, we generate the corresponding point cloud ${\mathbf{z}}_{0,x}$ with stage-1 diffusion model with $\boldsymbol{\epsilon}_\Theta^{x}$, and the corresponding point cloud features ${\mathbf{z}}_{0,h}$ can be further generated with $\boldsymbol{\epsilon}_\Theta^{h}$. As can be seen, the samples from stage-2 are consistent in overall 3D structures but with diverse textures. Thanks to the proposed Point Cloud-structured Latent space, our method supports interactive 3D structure editing. This is achieved by first modifying the stage-1 point cloud ${\mathbf{z}}_{0,x} \rightarrow {{\mathbf{z}}_{0,x}^{\prime}}$, and then regenerate the 3D object with the same Gaussian noise.
  • ...and 6 more figures