Table of Contents
Fetching ...

Evaluating Latent Generative Paradigms for High-Fidelity 3D Shape Completion from a Single Depth Image

Matthias Humt, Ulrich Hillenbrand, Rudolph Triebel

TL;DR

The paper tackles the ill-posed problem of completing high-fidelity 3D shapes from a single depth view. It conducts a rigorous, fair comparison between denoising diffusion probabilistic models (DDPM) with continuous latent spaces (via a VAE) and autoregressive transformers using discrete latent spaces (via a VQ-VAE), evaluating both shape modeling and completion tasks. Key contributions include state-of-the-art multi-modal completion from noisy depth images, a thorough quantitative comparison against discriminative baselines, and extensive ablations on model size, conditioning, and inference settings. The findings show that diffusion in a continuous latent space delivers superior performance for shape completion under realistic conditions, while autoregressive models can match or exceed diffusion in certain latent-space configurations, providing practical guidance on when to use which paradigm.

Abstract

While generative models have seen significant adoption across a wide range of data modalities, including 3D data, a consensus on which model is best suited for which task has yet to be reached. Further, conditional information such as text and images to steer the generation process are frequently employed, whereas others, like partial 3D data, have not been thoroughly evaluated. In this work, we compare two of the most promising generative models--Denoising Diffusion Probabilistic Models and Autoregressive Causal Transformers--which we adapt for the tasks of generative shape modeling and completion. We conduct a thorough quantitative evaluation and comparison of both tasks, including a baseline discriminative model and an extensive ablation study. Our results show that (1) the diffusion model with continuous latents outperforms both the discriminative model and the autoregressive approach and delivers state-of-the-art performance on multi-modal shape completion from a single, noisy depth image under realistic conditions and (2) when compared on the same discrete latent space, the autoregressive model can match or exceed diffusion performance on these tasks.

Evaluating Latent Generative Paradigms for High-Fidelity 3D Shape Completion from a Single Depth Image

TL;DR

The paper tackles the ill-posed problem of completing high-fidelity 3D shapes from a single depth view. It conducts a rigorous, fair comparison between denoising diffusion probabilistic models (DDPM) with continuous latent spaces (via a VAE) and autoregressive transformers using discrete latent spaces (via a VQ-VAE), evaluating both shape modeling and completion tasks. Key contributions include state-of-the-art multi-modal completion from noisy depth images, a thorough quantitative comparison against discriminative baselines, and extensive ablations on model size, conditioning, and inference settings. The findings show that diffusion in a continuous latent space delivers superior performance for shape completion under realistic conditions, while autoregressive models can match or exceed diffusion in certain latent-space configurations, providing practical guidance on when to use which paradigm.

Abstract

While generative models have seen significant adoption across a wide range of data modalities, including 3D data, a consensus on which model is best suited for which task has yet to be reached. Further, conditional information such as text and images to steer the generation process are frequently employed, whereas others, like partial 3D data, have not been thoroughly evaluated. In this work, we compare two of the most promising generative models--Denoising Diffusion Probabilistic Models and Autoregressive Causal Transformers--which we adapt for the tasks of generative shape modeling and completion. We conduct a thorough quantitative evaluation and comparison of both tasks, including a baseline discriminative model and an extensive ablation study. Our results show that (1) the diffusion model with continuous latents outperforms both the discriminative model and the autoregressive approach and delivers state-of-the-art performance on multi-modal shape completion from a single, noisy depth image under realistic conditions and (2) when compared on the same discrete latent space, the autoregressive model can match or exceed diffusion performance on these tasks.

Paper Structure

This paper contains 15 sections, 5 equations, 4 figures, 19 tables.

Figures (4)

  • Figure 1: Predicting complete shapes from partial, noisy inputs (1) that closely resemble the ground truth (2) object remains challenging when the input is highly ambiguous. We explore models that fit generative priors to latent distributions, enabling multi-modal shape completion. The generative models produce multiple plausible predictions (3-5) covering the range of possibilities (in descending similarity to ground truth), with some completions surpassing the quality of the single prediction (6) from discriminative models.
  • Figure 2: Generative shape completion. (1) Given an input point cloud (1.2) sampled from the surface of an object (1.1), we apply a positional encoding (1.3) and aggregate the entire point cloud into a farthest-point-sampled (FPS) set (1.4) as in zhang20233dshape2vecset, which we additionally passed through a feed-forward network to form a latent code. (2) We then model these latents either as a diagonal, multivariate Gaussian (2a) or quantize them into a fixed-sized codebook (2b) forming our (VQ-)VAE encoder and train a diffusion or autoregressive model on top, respectively. For shape completion, we condition the generative model on the encoding of a partial view (P) using a pre-trained feature extractor, which shares the overall architecture of the VAE. (3) We then predict occupancy probabilities through cross-attention between query points and latents sampled from the latent generative model, processed by $N$ Transformer encoder layers, forming the VAE decoder. (4) Optionally, a mesh can be extracted using the Marching Cubes algorithm. During inference, we discard the VAE encoder and sample latent codes either autoregressively or via denoising of samples drawn from a standard normal distribution.
  • Figure 3: Examples from the Automatica/YCB dataset. Left to right: input, ground truth, generative (best), discriminative.
  • Figure 4: Real-world examples using depth data from a Kinect sensor. From left to right: input, ground truth, generative (best), and discriminative.