Table of Contents
Fetching ...

BridgeShape: Latent Diffusion Schrödinger Bridge for 3D Shape Completion

Dequan Kong, Honghua Chen, Zhe Zhu, Mingqiang Wei

Abstract

Existing diffusion-based 3D shape completion methods typically use a conditional paradigm, injecting incomplete shape information into the denoising network via deep feature interactions (e.g., concatenation, cross-attention) to guide sampling toward complete shapes, often represented by voxel-based distance functions. However, these approaches fail to explicitly model the optimal global transport path, leading to suboptimal completions. Moreover, performing diffusion directly in voxel space imposes resolution constraints, limiting the generation of fine-grained geometric details. To address these challenges, we propose BridgeShape, a novel framework for 3D shape completion via latent diffusion Schrödinger bridge. The key innovations lie in two aspects: (i) BridgeShape formulates shape completion as an optimal transport problem, explicitly modeling the transition between incomplete and complete shapes to ensure a globally coherent transformation. (ii) We introduce a Depth-Enhanced Vector Quantized Variational Autoencoder (VQ-VAE) to encode 3D shapes into a compact latent space, leveraging self-projected multi-view depth information enriched with strong DINOv2 features to enhance geometric structural perception. By operating in a compact yet structurally informative latent space, BridgeShape effectively mitigates resolution constraints and enables more efficient and high-fidelity 3D shape completion. BridgeShape achieves state-of-the-art performance on large-scale 3D shape completion benchmarks, demonstrating superior fidelity at higher resolutions and for unseen object classes.

BridgeShape: Latent Diffusion Schrödinger Bridge for 3D Shape Completion

Abstract

Existing diffusion-based 3D shape completion methods typically use a conditional paradigm, injecting incomplete shape information into the denoising network via deep feature interactions (e.g., concatenation, cross-attention) to guide sampling toward complete shapes, often represented by voxel-based distance functions. However, these approaches fail to explicitly model the optimal global transport path, leading to suboptimal completions. Moreover, performing diffusion directly in voxel space imposes resolution constraints, limiting the generation of fine-grained geometric details. To address these challenges, we propose BridgeShape, a novel framework for 3D shape completion via latent diffusion Schrödinger bridge. The key innovations lie in two aspects: (i) BridgeShape formulates shape completion as an optimal transport problem, explicitly modeling the transition between incomplete and complete shapes to ensure a globally coherent transformation. (ii) We introduce a Depth-Enhanced Vector Quantized Variational Autoencoder (VQ-VAE) to encode 3D shapes into a compact latent space, leveraging self-projected multi-view depth information enriched with strong DINOv2 features to enhance geometric structural perception. By operating in a compact yet structurally informative latent space, BridgeShape effectively mitigates resolution constraints and enables more efficient and high-fidelity 3D shape completion. BridgeShape achieves state-of-the-art performance on large-scale 3D shape completion benchmarks, demonstrating superior fidelity at higher resolutions and for unseen object classes.

Paper Structure

This paper contains 43 sections, 15 equations, 13 figures, 13 tables.

Figures (13)

  • Figure 1: Comparison between existing diffusion-based shape completion paradigms and our proposed latent-diffusion-bridge-based approach. (a) Existing diffusion models incorporate an additional branch to inject deep features into the denoising process, transmitting incomplete shape information without explicitly modeling the transformation between the incomplete shape $\mathrm{X}_T$ and the complete shape $\mathrm{X}_{\mathrm{0}}$. (b) The proposed latent diffusion bridge explicitly models the optimal transport path between the latent distributions of incomplete and complete shapes ($\mathrm{Z}_T$ and $\mathrm{Z}_{\mathrm{0}}$, respectively). (c) Existing diffusion frameworks often produce less coherent completions with missing details, whereas (d) our latent diffusion bridge generates more structurally consistent and detailed 3D shapes. (e) Qualitative comparison of our BridgeShape with DiffComplete chu2023diffcomplete.
  • Figure 2: Overview of our training pipeline, which operates within the latent space based on the DSB. Stage I: Pre-training a Depth-Enhanced VQ-VAE on complete shapes to establish the latent space. Stage II: A co-trained encoder maps partial TSDF inputs into this latent space, where the diffusion bridge is applied to learn a structured diffusion trajectory between incomplete and complete shapes. This approach significantly enhances both efficiency and fidelity in shape completion.
  • Figure 3: Qualitative comparison of shape completion on 3D-EPN dai2017shape.
  • Figure 4: Qualitative comparison on the synthetic (pink) ShapeNet chang2015shapenet dataset and real-world (green) ScanNet dai2017scannet dataset.
  • Figure 5: (a) Architecture of the diffusion model, consisting of an encoder, intermediate blocks, and a decoder. (b) Detailed structure of the ResBlock. (c) Detailed structure of the AttentionBlock.
  • ...and 8 more figures