Table of Contents
Fetching ...

Large Point-to-Gaussian Model for Image-to-3D Generation

Longfei Lu, Huachen Gao, Tao Dai, Yaohua Zha, Zhi Hou, Junta Wu, Shu-Tao Xia

TL;DR

This work tackles image-to-3D generation by introducing a Point-to-Gaussian framework that converts a geometry-prior point cloud, produced by a large diffusion model conditioned on a single image, into explicit 3D Gaussian parameters for rendering. The core innovation is the APP block (Attention, Projection, Point feature extractor) which fuses 2D image features with 3D point-cloud features across multiple scales, enabling robust cross-modality learning. A multi-scale Gaussian decoder and a point-cloud upsampler produce high-fidelity Gaussians that, when rendered with Gaussian splatting, achieve strong qualitative and quantitative performance on Objaverse and GSO datasets, while training on a relatively small dataset. The method speeds up inference after the initial point-cloud stage and demonstrates strong generalization and view-consistency, highlighting its practical potential for rapid, high-quality 3D asset generation from images.

Abstract

Recently, image-to-3D approaches have significantly advanced the generation quality and speed of 3D assets based on large reconstruction models, particularly 3D Gaussian reconstruction models. Existing large 3D Gaussian models directly map 2D image to 3D Gaussian parameters, while regressing 2D image to 3D Gaussian representations is challenging without 3D priors. In this paper, we propose a large Point-to-Gaussian model, that inputs the initial point cloud produced from large 3D diffusion model conditional on 2D image to generate the Gaussian parameters, for image-to-3D generation. The point cloud provides initial 3D geometry prior for Gaussian generation, thus significantly facilitating image-to-3D Generation. Moreover, we present the \textbf{A}ttention mechanism, \textbf{P}rojection mechanism, and \textbf{P}oint feature extractor, dubbed as \textbf{APP} block, for fusing the image features with point cloud features. The qualitative and quantitative experiments extensively demonstrate the effectiveness of the proposed approach on GSO and Objaverse datasets, and show the proposed method achieves state-of-the-art performance.

Large Point-to-Gaussian Model for Image-to-3D Generation

TL;DR

This work tackles image-to-3D generation by introducing a Point-to-Gaussian framework that converts a geometry-prior point cloud, produced by a large diffusion model conditioned on a single image, into explicit 3D Gaussian parameters for rendering. The core innovation is the APP block (Attention, Projection, Point feature extractor) which fuses 2D image features with 3D point-cloud features across multiple scales, enabling robust cross-modality learning. A multi-scale Gaussian decoder and a point-cloud upsampler produce high-fidelity Gaussians that, when rendered with Gaussian splatting, achieve strong qualitative and quantitative performance on Objaverse and GSO datasets, while training on a relatively small dataset. The method speeds up inference after the initial point-cloud stage and demonstrates strong generalization and view-consistency, highlighting its practical potential for rapid, high-quality 3D asset generation from images.

Abstract

Recently, image-to-3D approaches have significantly advanced the generation quality and speed of 3D assets based on large reconstruction models, particularly 3D Gaussian reconstruction models. Existing large 3D Gaussian models directly map 2D image to 3D Gaussian parameters, while regressing 2D image to 3D Gaussian representations is challenging without 3D priors. In this paper, we propose a large Point-to-Gaussian model, that inputs the initial point cloud produced from large 3D diffusion model conditional on 2D image to generate the Gaussian parameters, for image-to-3D generation. The point cloud provides initial 3D geometry prior for Gaussian generation, thus significantly facilitating image-to-3D Generation. Moreover, we present the \textbf{A}ttention mechanism, \textbf{P}rojection mechanism, and \textbf{P}oint feature extractor, dubbed as \textbf{APP} block, for fusing the image features with point cloud features. The qualitative and quantitative experiments extensively demonstrate the effectiveness of the proposed approach on GSO and Objaverse datasets, and show the proposed method achieves state-of-the-art performance.
Paper Structure (29 sections, 10 equations, 8 figures, 3 tables)

This paper contains 29 sections, 10 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Illustration of our full pipeline. Given a single-view image as input, the corresponding point cloud is first generated via pre-trained 3D diffusion model shapee. Then the point cloud is sent as the input of our proposed Point-to-Gaussian Generator. The input image is introduced as complementary condition to our Gaussian generator.
  • Figure 2: Detailed architecture of Point-to-Gaussian Generator. Given the input point cloud and the corresponding image, a point cloud upsampler is first applied to increase the number of 3D points, followed by an encoder consisting of several APP Down Blocks to extract the multi-scale point cloud features. Each block contains point feature extractor, projection and attention for cross modality feature enhancement. The point cloud features are decoded by a series of APP Up Blocks and Multi Linear Heads to obtain the final 3D Gaussians, and then the novel view images are obtained by conventional Gaussian splatting.
  • Figure 3: Visualization of pose-aware projection. The red triangle and yellow star represents the corresponding position during projection. The self-occluded view only projects to the corresponding part of the point cloud during projection.
  • Figure 4: Qualitative results among baselines of single view image-to-3D reconstruction on Objaverse objaverse and Google Scanned Objects gso dataset. Our method outperforms previous Gaussian based baselines in both geometry and texture, and in particular our method generates more consistent multi-view renderings, thanks to the geometry prior of the point cloud input.
  • Figure 5: Visualization of effects inside the proposed APP block. w/o. pro represents removing projection mechanism, and w/o. ca represents removing cross-attention mechanism. The w/o. ca and pro represents that the APP block contains only point feature extractor.
  • ...and 3 more figures