Table of Contents
Fetching ...

RealDiff: Real-world 3D Shape Completion using Self-Supervised Diffusion Models

Başak Melis Öcal, Maxim Tatarchenko, Sezer Karaoglu, Theo Gevers

TL;DR

This work proposes a self-supervised framework, namely RealDiff, that formulates point cloud completion as a conditional generation problem directly on real-world measurements to better deal with noisy observations without resorting to training on synthetic data.

Abstract

Point cloud completion aims to recover the complete 3D shape of an object from partial observations. While approaches relying on synthetic shape priors achieved promising results in this domain, their applicability and generalizability to real-world data are still limited. To tackle this problem, we propose a self-supervised framework, namely RealDiff, that formulates point cloud completion as a conditional generation problem directly on real-world measurements. To better deal with noisy observations without resorting to training on synthetic data, we leverage additional geometric cues. Specifically, RealDiff simulates a diffusion process at the missing object parts while conditioning the generation on the partial input to address the multimodal nature of the task. We further regularize the training by matching object silhouettes and depth maps, predicted by our method, with the externally estimated ones. Experimental results show that our method consistently outperforms state-of-the-art methods in real-world point cloud completion.

RealDiff: Real-world 3D Shape Completion using Self-Supervised Diffusion Models

TL;DR

This work proposes a self-supervised framework, namely RealDiff, that formulates point cloud completion as a conditional generation problem directly on real-world measurements to better deal with noisy observations without resorting to training on synthetic data.

Abstract

Point cloud completion aims to recover the complete 3D shape of an object from partial observations. While approaches relying on synthetic shape priors achieved promising results in this domain, their applicability and generalizability to real-world data are still limited. To tackle this problem, we propose a self-supervised framework, namely RealDiff, that formulates point cloud completion as a conditional generation problem directly on real-world measurements. To better deal with noisy observations without resorting to training on synthetic data, we leverage additional geometric cues. Specifically, RealDiff simulates a diffusion process at the missing object parts while conditioning the generation on the partial input to address the multimodal nature of the task. We further regularize the training by matching object silhouettes and depth maps, predicted by our method, with the externally estimated ones. Experimental results show that our method consistently outperforms state-of-the-art methods in real-world point cloud completion.
Paper Structure (21 sections, 19 equations, 7 figures, 10 tables)

This paper contains 21 sections, 19 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Visual comparison of point cloud completion results. Compared to the SnowflakeNet snowflakenet baseline, our method can effectively restore the entire geometry while maintaining the integrity of the original structure.
  • Figure 2: Overview of our method. When given a pair of noisy point clouds representing an object, our pipeline takes one of these point clouds as input, and a pseudo ground-truth is created by combining the two point clouds. A diffusion process is simulated at the missing parts (unoccupied input voxels $\boldsymbol{\mathrm{\tilde{x}}}_{0}$) of the voxelized input $\boldsymbol{\mathrm{x}}_{0}$, while conditioning the generation on the known parts (occupied input voxels $\boldsymbol{\mathrm{c}}_{0}$). To eliminate the noise from the reconstructions, the rendered object shapes’ silhouettes $\boldsymbol{\mathrm{\hat{S}}}_{v_{j}}$ and depth maps $\boldsymbol{\mathrm{\hat{D}}}_{v_{j}}$ are constrained to match the auxiliary silhouettes $\boldsymbol{\mathrm{{S}}}_{v_{j}}$ (e.g. from ScanNet) and depth maps $\boldsymbol{\mathrm{{D}}}_{v_{j}}$ (e.g., from a pre-trained Omnidata model). At generation time, only $f_{\theta}$ is used to reconstruct a complete 3D shape from the real-world point cloud $\boldsymbol{\mathrm{p}}_{v_{1}}$.
  • Figure 3: Visual comparison of point cloud completion results on the ScanNet dataset dai2017scannet. From left to right: partial shapes sampled from depth images, completion from baselines, our results, and ground-truth CAD model alignments from Scan2CAD annotations. For multi-modal methods, we picked single output shapes corresponding to a specific random seed. Our methodology produces reconstructions that are both more comprehensive and adept at retaining the initially observed structural characteristics.
  • Figure 4: Shape completion on real-world scans from Redwood 3DScans dataset redwood3d. Our pre-trained model on ScanNet dai2017scannet is able to generalize to real-world partial shapes from another dataset.
  • Figure 5: Multimodal completion results for the ScanNet dataset. Shapes are ordered from left to right, top to bottom. Our method is able to generate multiple valid outputs across different runs. As the input incompleteness degree rises, the uncertainty in how to complete the shape geometry also increases which allows for a higher diversity (first, third and fourth shapes). When a more complete input is provided (second shape), we observe slight changes in the recovered geometry between different runs.
  • ...and 2 more figures