Table of Contents
Fetching ...

PointInfinity: Resolution-Invariant Point Diffusion Models

Zixuan Huang, Justin Johnson, Shoubhik Debnath, James M. Rehg, Chao-Yuan Wu

TL;DR

PointInfinity presents a resolution-invariant diffusion framework for RGB-D point clouds that trains at low resolution yet generates high-resolution outputs at test time. The core idea is a two-stream transformer with a fixed-size latent surface representation and a variable-sized data stream, enabling efficient training and scalable high-resolution generation without upsampling modules. Test-time resolution scaling yields higher surface fidelity and links to classifier-free guidance, achieving state-of-the-art results on CO3D with outputs up to 131k points while maintaining favorable compute and memory characteristics. This method significantly advances scalable, high-quality 3D point cloud generation and offers insights into the fidelity-variability trade-offs in diffusion-based generation.

Abstract

We present PointInfinity, an efficient family of point cloud diffusion models. Our core idea is to use a transformer-based architecture with a fixed-size, resolution-invariant latent representation. This enables efficient training with low-resolution point clouds, while allowing high-resolution point clouds to be generated during inference. More importantly, we show that scaling the test-time resolution beyond the training resolution improves the fidelity of generated point clouds and surfaces. We analyze this phenomenon and draw a link to classifier-free guidance commonly used in diffusion models, demonstrating that both allow trading off fidelity and variability during inference. Experiments on CO3D show that PointInfinity can efficiently generate high-resolution point clouds (up to 131k points, 31 times more than Point-E) with state-of-the-art quality.

PointInfinity: Resolution-Invariant Point Diffusion Models

TL;DR

PointInfinity presents a resolution-invariant diffusion framework for RGB-D point clouds that trains at low resolution yet generates high-resolution outputs at test time. The core idea is a two-stream transformer with a fixed-size latent surface representation and a variable-sized data stream, enabling efficient training and scalable high-resolution generation without upsampling modules. Test-time resolution scaling yields higher surface fidelity and links to classifier-free guidance, achieving state-of-the-art results on CO3D with outputs up to 131k points while maintaining favorable compute and memory characteristics. This method significantly advances scalable, high-quality 3D point cloud generation and offers insights into the fidelity-variability trade-offs in diffusion-based generation.

Abstract

We present PointInfinity, an efficient family of point cloud diffusion models. Our core idea is to use a transformer-based architecture with a fixed-size, resolution-invariant latent representation. This enables efficient training with low-resolution point clouds, while allowing high-resolution point clouds to be generated during inference. More importantly, we show that scaling the test-time resolution beyond the training resolution improves the fidelity of generated point clouds and surfaces. We analyze this phenomenon and draw a link to classifier-free guidance commonly used in diffusion models, demonstrating that both allow trading off fidelity and variability during inference. Experiments on CO3D show that PointInfinity can efficiently generate high-resolution point clouds (up to 131k points, 31 times more than Point-E) with state-of-the-art quality.
Paper Structure (42 sections, 7 equations, 6 figures, 5 tables)

This paper contains 42 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: We present a resolution-invariant point cloud diffusion model that trains at low-resolution (down to 64 points), but generates high-resolution point clouds (up to 131k points). This test-time resolution scaling improves our generation quality. We visualize our high-resolution 131k point clouds by converting them to a continuous surface.
  • Figure 2: Conditional 3D Point Cloud Generation with PointInfinity. (a): At the core of PointInfinity is a resolution-invariant conditional denoising model $\boldsymbol{\epsilon}_\theta$. It uses low-resolution point clouds for training and generates high-resolution point clouds at test time. (b): The main idea is a "Two-Stream" transformer design that decouples a fixed-sized latent representation $\boldsymbol{z}$ for capturing the underlying 3D shape and a variable-sized data representation $\boldsymbol{x}$ for modeling of the point could space. 'Read' and 'write' cross-attention modules are used to communicate between the two streams of processing. Note that most of the computation happens in the latent stream for modeling the underlying shape. This makes it less susceptible to the effects of point cloud resolution variations.
  • Figure 3: PointInfinity scales favorably compared to Point-E nichol2022point in both computation time and memory for both training and inference. (a,b): Thanks to the resolution-invariant property of PointInfinity, the training iteration time and memory stays constant regardless of the test-time resolution $n_\mathrm{test}$. Point-E on the other hand requires $n_\mathrm{train} = n_\mathrm{test}$ and scales quadratically. (c,d): Our inference time and memory scales linearly with respect to $n_\mathrm{test}$ with our two-stream transformer design, while Point-E scales quadratically with the vanilla transformer design.
  • Figure 4: PointInfinity achieves favorable computational complexity even compared with implicit methods such as Shap-E jun2023shap. The figures show PointInfinity is faster and more memory-efficient than Shap-E under a high test-time resolution of 16k.
  • Figure 5: Qualitative Evaluation on the CO3D-v2 Dataset reizenstein2021common. The point clouds generated by our model (column d,e,f) represent denser and more faithful surfaces as resolution increases. On the contrary, Point-E (column a, b) does not capture fine details. In addition, we see that PointInfinity obtains more accurate reconstructions from the 131k-resolution point clouds (column f) compared to MCC's surface reconstructions (column c).
  • ...and 1 more figures