Table of Contents
Fetching ...

Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors

Jiatong Xia, Zicheng Duan, Anton van den Hengel, Lingqiao Liu

Abstract

Recent progress in 3D generation has been driven largely by models conditioned on images or text, while readily available 3D priors are still underused. In many real-world scenarios, the visible-region point cloud are easy to obtain from active sensors such as LiDAR or from feed-forward predictors like VGGT, offering explicit geometric constraints that current methods fail to exploit. In this work, we introduce Points-to-3D, a diffusion-based framework that leverages point cloud priors for geometry-controllable 3D asset and scene generation. Built on a latent 3D diffusion model TRELLIS, Points-to-3D first replaces pure-noise sparse structure latent initialization with a point cloud priors tailored input formulation.A structure inpainting network, trained within the TRELLIS framework on task-specific data designed to learn global structural inpainting, is then used for inference with a staged sampling strategy (structural inpainting followed by boundary refinement), completing the global geometry while preserving the visible regions of the input priors.In practice, Points-to-3D can take either accurate point-cloud priors or VGGT-estimated point clouds from single images as input. Experiments on both objects and scene scenarios consistently demonstrate superior performance over state-of-the-art baselines in terms of rendering quality and geometric fidelity, highlighting the effectiveness of explicitly embedding point-cloud priors for achieving more accurate and structurally controllable 3D generation.

Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors

Abstract

Recent progress in 3D generation has been driven largely by models conditioned on images or text, while readily available 3D priors are still underused. In many real-world scenarios, the visible-region point cloud are easy to obtain from active sensors such as LiDAR or from feed-forward predictors like VGGT, offering explicit geometric constraints that current methods fail to exploit. In this work, we introduce Points-to-3D, a diffusion-based framework that leverages point cloud priors for geometry-controllable 3D asset and scene generation. Built on a latent 3D diffusion model TRELLIS, Points-to-3D first replaces pure-noise sparse structure latent initialization with a point cloud priors tailored input formulation.A structure inpainting network, trained within the TRELLIS framework on task-specific data designed to learn global structural inpainting, is then used for inference with a staged sampling strategy (structural inpainting followed by boundary refinement), completing the global geometry while preserving the visible regions of the input priors.In practice, Points-to-3D can take either accurate point-cloud priors or VGGT-estimated point clouds from single images as input. Experiments on both objects and scene scenarios consistently demonstrate superior performance over state-of-the-art baselines in terms of rendering quality and geometric fidelity, highlighting the effectiveness of explicitly embedding point-cloud priors for achieving more accurate and structurally controllable 3D generation.
Paper Structure (23 sections, 6 equations, 11 figures, 6 tables)

This paper contains 23 sections, 6 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Overall framework. Given point cloud priors—either pre-existing or predicted by VGGT from input image—we first voxelize and VAE-encode it to obtain an SS latent, where the empty regions are filled with random noise and concatenated with an extracted mask to form the input paradigm for our model. During training, the input training data is fed into our inpainting flow transformer $\mathcal{G}_{inp}$, which is optimized via a conditional flow matching loss. During inference, the input test data is processed by the trained $\mathcal{G}_{inp}$ through a two-stage sampling procedure: (1) a structural inpainting stage with $s$ sampling steps to inpaint the global structure. And (2) a boundary refinement stage with remaining $(t-s)$ steps to refine the inpainting boundaries, yielding the final output SS latent.
  • Figure 2: Training data processing. We preserve the visible portion of the complete point cloud and convert it into training inputs.
  • Figure 3: Single-object generation on Toys4K. For the explicit point cloud priors results, we use point cloud extracted strictly from the visible region of input images, whereas the "VGGT-estimated" results use point clouds inferred from the condition images by VGGT.
  • Figure 4: Scene-level generation on 3D-FRONT. The input point cloud priors setting is the same as in Fig. \ref{['fig:toys4k']}.
  • Figure 5: Ablation study. Allocating the full sampling to inpainting (Inp.) results in geometric “holes” along the inpainting edge.
  • ...and 6 more figures