Table of Contents
Fetching ...

NeuSDFusion: A Spatial-Aware Generative Model for 3D Shape Completion, Reconstruction, and Generation

Ruikai Cui, Weizhe Liu, Weixuan Sun, Senbo Wang, Taizhang Shang, Yang Li, Xibin Song, Han Yan, Zhennan Wu, Shenzhou Chen, Hongdong Li, Pan Ji

TL;DR

NeuSDFusion addresses the challenge of producing diverse, high-fidelity 3D shapes with spatially coherent structure under memory constraints. It introduces NeuSDF, a hybrid representation that encodes objects on three orthogonal planes, and a transformer-based spatial-aware autoencoder to compress these planes into latent tri-planes. A latent diffusion model then generates these tri-planes under multimodal conditioning (text, images, or point clouds) and decodes them into dense SDFs for marching cubes reconstruction. Across unconditional generation, multi-modal completion, single-view reconstruction, and language-guided generation, the approach achieves state-of-the-art results, highlighting improved quality, diversity, and efficiency for 3D generation tasks.

Abstract

3D shape generation aims to produce innovative 3D content adhering to specific conditions and constraints. Existing methods often decompose 3D shapes into a sequence of localized components, treating each element in isolation without considering spatial consistency. As a result, these approaches exhibit limited versatility in 3D data representation and shape generation, hindering their ability to generate highly diverse 3D shapes that comply with the specified constraints. In this paper, we introduce a novel spatial-aware 3D shape generation framework that leverages 2D plane representations for enhanced 3D shape modeling. To ensure spatial coherence and reduce memory usage, we incorporate a hybrid shape representation technique that directly learns a continuous signed distance field representation of the 3D shape using orthogonal 2D planes. Additionally, we meticulously enforce spatial correspondences across distinct planes using a transformer-based autoencoder structure, promoting the preservation of spatial relationships in the generated 3D shapes. This yields an algorithm that consistently outperforms state-of-the-art 3D shape generation methods on various tasks, including unconditional shape generation, multi-modal shape completion, single-view reconstruction, and text-to-shape synthesis. Our project page is available at https://weizheliu.github.io/NeuSDFusion/ .

NeuSDFusion: A Spatial-Aware Generative Model for 3D Shape Completion, Reconstruction, and Generation

TL;DR

NeuSDFusion addresses the challenge of producing diverse, high-fidelity 3D shapes with spatially coherent structure under memory constraints. It introduces NeuSDF, a hybrid representation that encodes objects on three orthogonal planes, and a transformer-based spatial-aware autoencoder to compress these planes into latent tri-planes. A latent diffusion model then generates these tri-planes under multimodal conditioning (text, images, or point clouds) and decodes them into dense SDFs for marching cubes reconstruction. Across unconditional generation, multi-modal completion, single-view reconstruction, and language-guided generation, the approach achieves state-of-the-art results, highlighting improved quality, diversity, and efficiency for 3D generation tasks.

Abstract

3D shape generation aims to produce innovative 3D content adhering to specific conditions and constraints. Existing methods often decompose 3D shapes into a sequence of localized components, treating each element in isolation without considering spatial consistency. As a result, these approaches exhibit limited versatility in 3D data representation and shape generation, hindering their ability to generate highly diverse 3D shapes that comply with the specified constraints. In this paper, we introduce a novel spatial-aware 3D shape generation framework that leverages 2D plane representations for enhanced 3D shape modeling. To ensure spatial coherence and reduce memory usage, we incorporate a hybrid shape representation technique that directly learns a continuous signed distance field representation of the 3D shape using orthogonal 2D planes. Additionally, we meticulously enforce spatial correspondences across distinct planes using a transformer-based autoencoder structure, promoting the preservation of spatial relationships in the generated 3D shapes. This yields an algorithm that consistently outperforms state-of-the-art 3D shape generation methods on various tasks, including unconditional shape generation, multi-modal shape completion, single-view reconstruction, and text-to-shape synthesis. Our project page is available at https://weizheliu.github.io/NeuSDFusion/ .
Paper Structure (15 sections, 7 equations, 7 figures, 4 tables)

This paper contains 15 sections, 7 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: NeuSDFusion demonstrates exceptional performance in generating high-quality, diverse shapes with smooth surfaces and detailed structures, showcasing its capability in 3D shape synthesis across various tasks including unconditional generation, single-view reconstruction, shape completion, and text-to-3D synthesis.
  • Figure 2: Our method follows a pipeline consisting of three stages. Given a raw mesh, we first sample surface points and space-filling points to adapt each mesh to a NeuSDF representation. In the second stage, we compress the raw tri-plane representation into latent tri-planes $z$ with a spatial-aware autoencoder. In the third stage, we train a latent diffusion model capable of generating tri-plane latent $z_0$ from a standard Gaussian under flexible conditions. During the inference phase, we input the generated latent $z_0$ into the decoder, and generate a mesh using the Marching Cubes algorithm by querying the signed distance value of any position via interpolating the reconstructed tri-plane.
  • Figure 3: An illustration of the spatial-aware autoencoder design. Both (a) a roll-out mechanism and (b) a channel-wise concatenation strategy utilize a convolutional neural network to manipulate a 2D feature map, which leads to a contextual disorder. To address this issue, we propose the (c) all as tokens operation which is designed to preserve spatial coherence. This operation is facilitated by (d) a transformer-based autoencoder structure and the implementation of (e) a spatial-aware position embedding (SAPE) technique.
  • Figure 4: Multi-modal shape completion results. Our NeuSDFusion method generates shapes with superior quality and diversity compared to previous state-of-the-art approaches, while remaining consistent with the input partial shapes.
  • Figure 5: Single-view reconstruction on the Pix3D dataset. Note that our approach generates significantly more detailed shapes compared to previous works, demonstrating the effectiveness of our method in capturing intricate shape properties.
  • ...and 2 more figures