Table of Contents
Fetching ...

Spice-E : Structural Priors in 3D Diffusion using Cross-Entity Attention

Etai Sella, Gal Fiebelman, Noam Atia, Hadar Averbuch-Elor

TL;DR

Spice-E introduces cross-entity attention to pretrained transformer-based 3D diffusion models to inject structural priors learned from auxiliary guidance shapes. By replacing self-attention blocks with cross-entity blocks that mix input and guidance latent streams, the method learns task-specific priors while preserving generative capabilities, optimized via a denoising objective $L = \mathbb{E}_{z_0,t,c_{text},z_c} \| \mathcal{M}_{\theta}(z_t,t,c_{text},z_c) - z_0 \|^2$. During inference, a guidance shape and a text prompt jointly condition the output, yielding high-fidelity 3D shapes represented as NeRF or STF, with optional refinement from 2D diffusion-based GaussianDreamer. Across semantic shape editing, abstraction-to-3D, and 3D stylization tasks, Spice-E achieves state-of-the-art or competitive results while offering substantially faster inference and general applicability without task-specific tailoring.

Abstract

We are witnessing rapid progress in automatically generating and manipulating 3D assets due to the availability of pretrained text-image diffusion models. However, time-consuming optimization procedures are required for synthesizing each sample, hindering their potential for democratizing 3D content creation. Conversely, 3D diffusion models now train on million-scale 3D datasets, yielding high-quality text-conditional 3D samples within seconds. In this work, we present Spice-E - a neural network that adds structural guidance to 3D diffusion models, extending their usage beyond text-conditional generation. At its core, our framework introduces a cross-entity attention mechanism that allows for multiple entities (in particular, paired input and guidance 3D shapes) to interact via their internal representations within the denoising network. We utilize this mechanism for learning task-specific structural priors in 3D diffusion models from auxiliary guidance shapes. We show that our approach supports a variety of applications, including 3D stylization, semantic shape editing and text-conditional abstraction-to-3D, which transforms primitive-based abstractions into highly-expressive shapes. Extensive experiments demonstrate that Spice-E achieves SOTA performance over these tasks while often being considerably faster than alternative methods. Importantly, this is accomplished without tailoring our approach for any specific task.

Spice-E : Structural Priors in 3D Diffusion using Cross-Entity Attention

TL;DR

Spice-E introduces cross-entity attention to pretrained transformer-based 3D diffusion models to inject structural priors learned from auxiliary guidance shapes. By replacing self-attention blocks with cross-entity blocks that mix input and guidance latent streams, the method learns task-specific priors while preserving generative capabilities, optimized via a denoising objective . During inference, a guidance shape and a text prompt jointly condition the output, yielding high-fidelity 3D shapes represented as NeRF or STF, with optional refinement from 2D diffusion-based GaussianDreamer. Across semantic shape editing, abstraction-to-3D, and 3D stylization tasks, Spice-E achieves state-of-the-art or competitive results while offering substantially faster inference and general applicability without task-specific tailoring.

Abstract

We are witnessing rapid progress in automatically generating and manipulating 3D assets due to the availability of pretrained text-image diffusion models. However, time-consuming optimization procedures are required for synthesizing each sample, hindering their potential for democratizing 3D content creation. Conversely, 3D diffusion models now train on million-scale 3D datasets, yielding high-quality text-conditional 3D samples within seconds. In this work, we present Spice-E - a neural network that adds structural guidance to 3D diffusion models, extending their usage beyond text-conditional generation. At its core, our framework introduces a cross-entity attention mechanism that allows for multiple entities (in particular, paired input and guidance 3D shapes) to interact via their internal representations within the denoising network. We utilize this mechanism for learning task-specific structural priors in 3D diffusion models from auxiliary guidance shapes. We show that our approach supports a variety of applications, including 3D stylization, semantic shape editing and text-conditional abstraction-to-3D, which transforms primitive-based abstractions into highly-expressive shapes. Extensive experiments demonstrate that Spice-E achieves SOTA performance over these tasks while often being considerably faster than alternative methods. Importantly, this is accomplished without tailoring our approach for any specific task.
Paper Structure (29 sections, 5 equations, 14 figures, 10 tables)

This paper contains 29 sections, 5 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: Finetuning 3D diffusion models with Spice-E. We finetune a transformer-based diffusion model jun2023shap, pretrained on a large dataset of text-conditional 3D assets, to enable structural control over the generated 3D shapes. The diffusion model (in gray) is modified to use latent vectors from multiple entities at each step $\mathbf{t}$ -- a conditional guidance shape $\mathbf{X}_c$ encoded into the guidance latent $\mathbf{Z}_c$ and a noisy input latent $\mathbf{Z}_t$ . The self-attention layers are replaced with our proposed cross-entity attention mechanism. At inference time the fine-tuned diffusion model receives the guidance latent $\mathbf{Z}_c$, random gaussian noise $\mathbf{Z}_T$ and a guidance text as input and over $T$ steps gradually denoises the input to produce an output latent $\mathbf{\hat{Z}}_0$. The output latent can be decoded into the output shape $\mathbf{X}_{out}$, represented as either a neural radiance field or a signed texture field.
  • Figure 2: Cross-Entity Attention. Given a pretrained self-attention block, we add a conditional latent $c$ originating from a different entity (i.e. 3D shape). Our proposed mechanism mixes the Queries features (after a zero-convolution operator $\mathcal{Z}$ is applied to $c$), allowing for incorporating structural priors from $c$.
  • Figure 3: Semantic shape editing results are shown above (input guidance shape on the left and edited outputs on the right, shown in different colors for visualization purposes). As illustrated in the figure, our method can semantically edit input shapes according to target prompts, while preserving the shape's structure.
  • Figure 4: Semantic Shape Editing Comparison. We compare to prior work performing semantic shape editing above. As ChangeIt3D achlioptas2022changeit3d operates over a point cloud representation, we show input point clouds on the left and edited point clouds on the right. For our results, we visualize the point clouds after shape encoding, hence our inputs are not identical to theirs. As illustrated in the figure, our method can perform more significant edits, yielding edited shapes that better reflect the target prompts.
  • Figure 5: Text-conditional Abstraction-to-3D Comparison. We compare to the results obtained using SketchShape metzer2023latent and Fantasia3D chen2023fantasia3d. Methods are provided with a proxy cuboid-based abstract shape with a target prompt (left). As illustrated in the figure, our results better preserve the structure of the abstract guidance shape, while conveying the target text prompt. In the rightmost column (denoted as "Ours++"), we present results obtained after the optional refinement stage.
  • ...and 9 more figures