Table of Contents
Fetching ...

Diffusion-SDF: Conditional Generative Modeling of Signed Distance Functions

Gene Chou, Yuval Bahat, Felix Heide

TL;DR

<3-5 sentence high-level summary> Diffusion-SDF reframes 3D shape generation as diffusion in the latent space of neural signed distance functions (SDFs), enabling both unconditional generation and conditional shape completion from partial point clouds, real scans, and 2D images. The method introduces a modulation module that compresses SDFs into latent vectors via a PointNet–VAE pair, followed by a diffusion model that denoises these latents; conditioning is achieved through encoders and cross-attention, allowing multi-modal guidance. End-to-end training with geometry-consistency losses couples the SDF, latent, and diffusion components to produce plausible, diverse surfaces, and experiments show strong performance on unconditional and conditional tasks across large-scale datasets. The work demonstrates scalable 3D generation in implicit representations and opens avenues for multi-modal, text-to-shape and scene-level synthesis.

Abstract

Probabilistic diffusion models have achieved state-of-the-art results for image synthesis, inpainting, and text-to-image tasks. However, they are still in the early stages of generating complex 3D shapes. This work proposes Diffusion-SDF, a generative model for shape completion, single-view reconstruction, and reconstruction of real-scanned point clouds. We use neural signed distance functions (SDFs) as our 3D representation to parameterize the geometry of various signals (e.g., point clouds, 2D images) through neural networks. Neural SDFs are implicit functions and diffusing them amounts to learning the reversal of their neural network weights, which we solve using a custom modulation module. Extensive experiments show that our method is capable of both realistic unconditional generation and conditional generation from partial inputs. This work expands the domain of diffusion models from learning 2D, explicit representations, to 3D, implicit representations.

Diffusion-SDF: Conditional Generative Modeling of Signed Distance Functions

TL;DR

<3-5 sentence high-level summary> Diffusion-SDF reframes 3D shape generation as diffusion in the latent space of neural signed distance functions (SDFs), enabling both unconditional generation and conditional shape completion from partial point clouds, real scans, and 2D images. The method introduces a modulation module that compresses SDFs into latent vectors via a PointNet–VAE pair, followed by a diffusion model that denoises these latents; conditioning is achieved through encoders and cross-attention, allowing multi-modal guidance. End-to-end training with geometry-consistency losses couples the SDF, latent, and diffusion components to produce plausible, diverse surfaces, and experiments show strong performance on unconditional and conditional tasks across large-scale datasets. The work demonstrates scalable 3D generation in implicit representations and opens avenues for multi-modal, text-to-shape and scene-level synthesis.

Abstract

Probabilistic diffusion models have achieved state-of-the-art results for image synthesis, inpainting, and text-to-image tasks. However, they are still in the early stages of generating complex 3D shapes. This work proposes Diffusion-SDF, a generative model for shape completion, single-view reconstruction, and reconstruction of real-scanned point clouds. We use neural signed distance functions (SDFs) as our 3D representation to parameterize the geometry of various signals (e.g., point clouds, 2D images) through neural networks. Neural SDFs are implicit functions and diffusing them amounts to learning the reversal of their neural network weights, which we solve using a custom modulation module. Extensive experiments show that our method is capable of both realistic unconditional generation and conditional generation from partial inputs. This work expands the domain of diffusion models from learning 2D, explicit representations, to 3D, implicit representations.
Paper Structure (25 sections, 8 equations, 6 figures, 3 tables)

This paper contains 25 sections, 8 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Our method generates clean meshes with diverse geometries. (Top) Unconditional generations from training on multiple classes. (Bottom) Conditional generation given various visual inputs, such as partial point clouds (same point cloud overlaid on sample), real-scanned point clouds, and 2D images. Our method captures details of conditioned geometry, such as the handle of the pitcher.
  • Figure 2: Our two-stage training pipeline. The first (top) trains SDFs jointly with a VAE vae to produce latent vectors $\textbf{z}$ each representing an SDF embedding. The second stage (bottom) uses the latent vectors as input to our diffusion model and can be guided by various inputs. We connect the two models (gray arrow) for end-to-end training. During test time, the diffusion model takes input $z$ sampled from a Gaussian distribution and we combine its output with the SDF network to form a complete SDF representation.
  • Figure 3: Samples from unconditional generation. Our method produces clean meshes with thin structures and diverse geometries. We also calculate their average CD to each object in the training set to confirm that our model is capable of producing unique shapes.
  • Figure 4: Shape completion results from sparse, partial point clouds. Reconstructions from the proposed method represent details such as the legs of the chair, whether they are separated (top), branched out (middle), or connected (bottom).
  • Figure 5: Reconstructing scanned point clouds and single images. Our method captures details of conditioned geometry, such as the curves of the drill, engines of the plane, and pillows on the couch.
  • ...and 1 more figures