Table of Contents
Fetching ...

SALAD: Part-Level Latent Diffusion for 3D Shape Generation and Manipulation

Juil Koo, Seungwoo Yoo, Minh Hieu Nguyen, Minhyuk Sung

Abstract

We present a cascaded diffusion model based on a part-level implicit 3D representation. Our model achieves state-of-the-art generation quality and also enables part-level shape editing and manipulation without any additional training in conditional setup. Diffusion models have demonstrated impressive capabilities in data generation as well as zero-shot completion and editing via a guided reverse process. Recent research on 3D diffusion models has focused on improving their generation capabilities with various data representations, while the absence of structural information has limited their capability in completion and editing tasks. We thus propose our novel diffusion model using a part-level implicit representation. To effectively learn diffusion with high-dimensional embedding vectors of parts, we propose a cascaded framework, learning diffusion first on a low-dimensional subspace encoding extrinsic parameters of parts and then on the other high-dimensional subspace encoding intrinsic attributes. In the experiments, we demonstrate the outperformance of our method compared with the previous ones both in generation and part-level completion and manipulation tasks.

SALAD: Part-Level Latent Diffusion for 3D Shape Generation and Manipulation

Abstract

We present a cascaded diffusion model based on a part-level implicit 3D representation. Our model achieves state-of-the-art generation quality and also enables part-level shape editing and manipulation without any additional training in conditional setup. Diffusion models have demonstrated impressive capabilities in data generation as well as zero-shot completion and editing via a guided reverse process. Recent research on 3D diffusion models has focused on improving their generation capabilities with various data representations, while the absence of structural information has limited their capability in completion and editing tasks. We thus propose our novel diffusion model using a part-level implicit representation. To effectively learn diffusion with high-dimensional embedding vectors of parts, we propose a cascaded framework, learning diffusion first on a low-dimensional subspace encoding extrinsic parameters of parts and then on the other high-dimensional subspace encoding intrinsic attributes. In the experiments, we demonstrate the outperformance of our method compared with the previous ones both in generation and part-level completion and manipulation tasks.
Paper Structure (41 sections, 15 equations, 13 figures, 4 tables)

This paper contains 41 sections, 15 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Part-Level implicit representation by Hertz et al. Hertz:2022Spaghetti. A latent vector $\mathbf{z}$ encoding global geometry is first mapped to a set of part latents $\{\mathbf{p}_i\}_{i=1}^N$, each of which is decomposed into extrinsic parameters $\{\mathbf{e}_i\}_{i=1}^N$ and intrinsic latents $\{\mathbf{s}_i\}_{i=1}^N$. The decoder, conditioned on $\{(\mathbf{e}_i, \mathbf{s}_i)\}_{i=1}$, outputs an occupancy value given a query point $\mathbf{x}$.
  • Figure 2: Pipeline overview.SALAD consists of two diffusion models for extrinsic and intrinsic vectors, respectively. During phase 1 (left), it generates extrinsic vectors representing structures of shapes. Phase 2 (right) takes these outputs as conditions and produces intrinsic vectors encoding local geometry information.
  • Figure 3: Architecture diagrams. The architecture for Diffusion of $\mathbf{z}$ is a sequence of $M$ alternating MLPs and AdaIN Perez:2018AdaLN layers. Time-Conditioned Transformer, a Transformer Vaswani:2017Attention architecture designed to handle diffusion on set data, replaces MLPs with self-attention layers. SALAD is a cascaded two Time-Conditioned Transformers: one for diffusion of $\{\mathbf{e}_i\}_{i=1}^N$ and the other for $\{\mathbf{s}_i\}_{i=1}^N$. In the second phase of SALAD, a concatenation of $\{\mathbf{e}_i\}_{i=1}^N$ and $\gamma(t)$ is fed to AdaIN layers as conditioning input.
  • Figure 4: Qualitative comparison of the shape generation. Given a query ground truth shape, we retrieve the closest generated shape by measuring EMD in each method. SALAD produces highly detailed 3D shapes compared to the baselines.
  • Figure 5: Qualitative comparison of the part completion. We examine SALAD and other baselines in part completion after ablating semantic parts or regions, highlighted in red in columns 2 and 3. SALAD produces realistic completions for missing parts. The baselines fail to preserve observed parts or introduce noticeable seams at bounding box boundaries.
  • ...and 8 more figures