GaussianAnything: Interactive Point Cloud Flow Matching For 3D Object Generation
Yushi Lan, Shangchen Zhou, Zhaoyang Lyu, Fangzhou Hong, Shuai Yang, Bo Dai, Xingang Pan, Chen Change Loy
TL;DR
GaussianAnything introduces a point-cloud structured latent space $Z=[z_x \oplus z_h]$ learned by a multi-view RGB-D-N 3D VAE and trains cascaded diffusion via flow matching to produce surfel Gaussians for 3D objects. The 3D VAE encodes $V$ views into a sparse point-cloud latent, cross-attention maps to a latent space, and decodes to high-quality surfel Gaussians, while the two-stage diffusion (geometry-first, texture-second) is conditioned by text via CLIP and by images via DINOv2. This approach yields state-of-the-art results on text- and image-conditioned 3D generation and enables geometry-texture disentanglement and interactive 3D editing, with near-complete Gaussian utilization during rendering. Overall, GaussianAnything provides a scalable, editable pathway for native 3D content generation with broad applicability to virtual reality, film, and design workflows.
Abstract
While 3D content generation has advanced significantly, existing methods still face challenges with input formats, latent space design, and output representations. This paper introduces a novel 3D generation framework that addresses these challenges, offering scalable, high-quality 3D generation with an interactive Point Cloud-structured Latent space. Our framework employs a Variational Autoencoder (VAE) with multi-view posed RGB-D(epth)-N(ormal) renderings as input, using a unique latent space design that preserves 3D shape information, and incorporates a cascaded latent flow-based model for improved shape-texture disentanglement. The proposed method, GaussianAnything, supports multi-modal conditional 3D generation, allowing for point cloud, caption, and single image inputs. Notably, the newly proposed latent space naturally enables geometry-texture disentanglement, thus allowing 3D-aware editing. Experimental results demonstrate the effectiveness of our approach on multiple datasets, outperforming existing native 3D methods in both text- and image-conditioned 3D generation.
