Locally Attentional SDF Diffusion for Controllable 3D Shape Generation

Xin-Yang Zheng; Hao Pan; Peng-Shuai Wang; Xin Tong; Yang Liu; Heung-Yeung Shum

Locally Attentional SDF Diffusion for Controllable 3D Shape Generation

Xin-Yang Zheng, Hao Pan, Peng-Shuai Wang, Xin Tong, Yang Liu, Heung-Yeung Shum

TL;DR

<3-5 sentence high-level summary> LAS-Diffusion introduces a two-stage diffusion framework operating in discrete SDF space to generate high-quality 3D shapes from 2D sketches. It combines occupancy-diffusion to form a shell with SDF-diffusion to refine internal geometry, guided by view-aware local attention that uses 2D patch features from a pretrained ViT. The approach achieves strong local controllability, generalization to unseen shapes, and competitive category-conditioned results, while remaining computationally efficient. The work highlights promising directions for multimodal, sketch-driven 3D content creation and provides codes for replication.

Abstract

Although the recent rapid evolution of 3D generative neural networks greatly improves 3D shape generation, it is still not convenient for ordinary users to create 3D shapes and control the local geometry of generated shapes. To address these challenges, we propose a diffusion-based 3D generation framework -- locally attentional SDF diffusion, to model plausible 3D shapes, via 2D sketch image input. Our method is built on a two-stage diffusion model. The first stage, named occupancy-diffusion, aims to generate a low-resolution occupancy field to approximate the shape shell. The second stage, named SDF-diffusion, synthesizes a high-resolution signed distance field within the occupied voxels determined by the first stage to extract fine geometry. Our model is empowered by a novel view-aware local attention mechanism for image-conditioned shape generation, which takes advantage of 2D image patch features to guide 3D voxel feature learning, greatly improving local controllability and model generalizability. Through extensive experiments in sketch-conditioned and category-conditioned 3D shape generation tasks, we validate and demonstrate the ability of our method to provide plausible and diverse 3D shapes, as well as its superior controllability and generalizability over existing work. Our code and trained models are available at https://zhengxinyang.github.io/projects/LAS-Diffusion.html

Locally Attentional SDF Diffusion for Controllable 3D Shape Generation

TL;DR

Abstract

Paper Structure (58 sections, 6 equations, 21 figures, 4 tables)

This paper contains 58 sections, 6 equations, 21 figures, 4 tables.

Introduction
Related Work
Shape representations in 3D generation
GAN-based 3D generation
Autoregressive-based 3D generation
Diffusion-based 3D generation
Conditional 3D generation
Sketch-based shape reconstruction and generation
View-Aware Locally Attentional SDF Diffusion
Method Overview
Discrete signed distance function
Discrete surface-occupancy function
Two-stage diffusion
Sketch-conditioned generation
Self-conditioning Continuous Diffusion Model
...and 43 more sections

Figures (21)

Figure 1: Our LAS-Diffusion model includes two stages: occupancy-diffusion and SDF-diffusion. Occupancy-diffusion takes a noisy $64^3$ voxel grid as input, and uses a 3D U-Net to transform the volume to an occupancy volume. The occupied voxels are subdivided into a $128^3$ sparse voxel grid and filled with random noise. SDF-diffusion takes this noisy sparse voxel grid as input, and transforms noise signals to SDF values via a 3D sparse-voxel-based U-Net. For sketch-conditional inputs, the local image patch features obtained from a pretrained ViT backbone interact with U-Net voxel features via a view-aware local attention mechanism, to offer local controllability and better generalizability.
Figure 2: Illustration of our view-aware local attention mechanism. For voxel $V$, its voxel center is projected onto the image plane at $\mathbf{p}$, via a known perspective projection. We use the image patch features of the local patches around $\mathbf{p}$ (in yellow color), to interact with voxel feature at $V$ in the U-Net, via cross-attention. For other voxels such as $U$, the operation is similar.
Figure 3: Left: Camera setup. Right: Shading images and sketches under the predefined views.
Figure 4: Sketch-conditioned shape generation on IKEA chairs.
Figure 5: The sketches of the results shown in \ref{['fig:ikea']}. From left to right: Sketch2Model, Sketch2Mesh, SketchSampler, LAS-Diffusion and GT.
...and 16 more figures

Locally Attentional SDF Diffusion for Controllable 3D Shape Generation

TL;DR

Abstract

Locally Attentional SDF Diffusion for Controllable 3D Shape Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (21)