Table of Contents
Fetching ...

Efficient and Scalable Point Cloud Generation with Sparse Point-Voxel Diffusion Models

Ioannis Romanelis, Vlassios Fotis, Athanasios Kalogeras, Christos Alexakos, Konstantinos Moustakas, Adrian Munteanu

TL;DR

This work introduces SPVD, a sparse point‑voxel diffusion U‑Net that jointly leverages a high‑fidelity point branch and a sparse voxel backbone to enable efficient and scalable 3D point‑cloud generation. By GPU‑based voxelization and a graph‑structured integration of time embeddings, SPVD achieves faster generation than prior diffusion models while attaining state‑of‑the‑art results among diffusion methods on ShapeNet, and it scales to conditional generation across categories, implicit generation with fewer timesteps, as well as completion and super‑resolution tasks. The approach demonstrates robust performance across unconditional and conditional generation, completing partial shapes and upsampling point density, making diffusion‑based 3D generation more practical. The work also provides a public implementation and discusses potential future directions, including latent diffusion pipelines and guidance, to broaden applicability to larger datasets like Objaverse and beyond.

Abstract

We propose a novel point cloud U-Net diffusion architecture for 3D generative modeling capable of generating high-quality and diverse 3D shapes while maintaining fast generation times. Our network employs a dual-branch architecture, combining the high-resolution representations of points with the computational efficiency of sparse voxels. Our fastest variant outperforms all non-diffusion generative approaches on unconditional shape generation, the most popular benchmark for evaluating point cloud generative models, while our largest model achieves state-of-the-art results among diffusion methods, with a runtime approximately 70% of the previously state-of-the-art PVD. Beyond unconditional generation, we perform extensive evaluations, including conditional generation on all categories of ShapeNet, demonstrating the scalability of our model to larger datasets, and implicit generation which allows our network to produce high quality point clouds on fewer timesteps, further decreasing the generation time. Finally, we evaluate the architecture's performance in point cloud completion and super-resolution. Our model excels in all tasks, establishing it as a state-of-the-art diffusion U-Net for point cloud generative modeling. The code is publicly available at https://github.com/JohnRomanelis/SPVD.git.

Efficient and Scalable Point Cloud Generation with Sparse Point-Voxel Diffusion Models

TL;DR

This work introduces SPVD, a sparse point‑voxel diffusion U‑Net that jointly leverages a high‑fidelity point branch and a sparse voxel backbone to enable efficient and scalable 3D point‑cloud generation. By GPU‑based voxelization and a graph‑structured integration of time embeddings, SPVD achieves faster generation than prior diffusion models while attaining state‑of‑the‑art results among diffusion methods on ShapeNet, and it scales to conditional generation across categories, implicit generation with fewer timesteps, as well as completion and super‑resolution tasks. The approach demonstrates robust performance across unconditional and conditional generation, completing partial shapes and upsampling point density, making diffusion‑based 3D generation more practical. The work also provides a public implementation and discusses potential future directions, including latent diffusion pipelines and guidance, to broaden applicability to larger datasets like Objaverse and beyond.

Abstract

We propose a novel point cloud U-Net diffusion architecture for 3D generative modeling capable of generating high-quality and diverse 3D shapes while maintaining fast generation times. Our network employs a dual-branch architecture, combining the high-resolution representations of points with the computational efficiency of sparse voxels. Our fastest variant outperforms all non-diffusion generative approaches on unconditional shape generation, the most popular benchmark for evaluating point cloud generative models, while our largest model achieves state-of-the-art results among diffusion methods, with a runtime approximately 70% of the previously state-of-the-art PVD. Beyond unconditional generation, we perform extensive evaluations, including conditional generation on all categories of ShapeNet, demonstrating the scalability of our model to larger datasets, and implicit generation which allows our network to produce high quality point clouds on fewer timesteps, further decreasing the generation time. Finally, we evaluate the architecture's performance in point cloud completion and super-resolution. Our model excels in all tasks, establishing it as a state-of-the-art diffusion U-Net for point cloud generative modeling. The code is publicly available at https://github.com/JohnRomanelis/SPVD.git.
Paper Structure (17 sections, 8 equations, 8 figures, 4 tables)

This paper contains 17 sections, 8 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: The proposed Sparse Point-Voxel Diffusion (SPVD) is a novel diffusion architecture designed for efficient and scalable point cloud generation tasks. The generation process visualizes the gradual transformation of a noisy sample into a clean 3D shape. The completion and super-resolution tasks further demonstrate the capabilities of the proposed architecture.
  • Figure 2: Illustration of the forward and reverse diffusion processes in DDPM. Initially, a clean shape is progressively noisified through the forward diffusion process, resulting in increasingly noisy samples up to $\mathbf{x}_{T}$. These samples, generated via a predefined noise schedule, are utilized during the training phase. The reverse process, indicated by the arrows, involves a neural network tasked with estimating the inverse noise distribution to progressively denoise the samples, eventually reconstructing a clean shape $\mathbf{x}_{0}$.
  • Figure 3: Example architecture of the Sparse Point-Voxel U-Net. The initial point clouds are voxelized and sparse convolutions extract features incorporating neighborhood information. These features are propagated back to the point representation and are merged with the point features, extracted through shared-MLPs. This dual branch architecture is called Sparse Point-Voxel Block (SPVBlock). Note that, as shown, the voxel computations at each SPVBlock may vary, and the point branches in the encoder and decoder do not need to be symmetric. Additionally, we illustrate how sparse voxels and time embeddings can be linked as graph nodes to efficiently handle the varying number of sparse voxels in each point cloud in a batch.
  • Figure 4: Illustration of a Sparse Residual Convolutional Block. Time embedding information is integrated into the voxel features between two successive convolutional blocks. An optional attention block can further process the voxel features to incorporate global shape information.
  • Figure 5: Results of unconditional generation using our three model variants compared to PVD PVD. While all models produce high-quality point clouds, our largest models can generate more unique shapes with coarser features, whereas PVD has lower shape diversity. For each model we report the generation time of a batch with 32 samples.
  • ...and 3 more figures