Table of Contents
Fetching ...

Sin3DM: Learning a Diffusion Model from a Single 3D Textured Shape

Rundi Wu, Ruoshi Liu, Carl Vondrick, Changxi Zheng

TL;DR

Sin3DM addresses single-instance 3D textured shape generation by learning a diffusion model in a compact triplane latent space, derived from a surface SDF $d(p)$ and texture $c(p)$. It trains a small-receptive-field denoiser with triplane-aware convolutions to capture patch-level variations while preserving global structure, and decodes samples into textured meshes via marching cubes and texture mapping. Compared to baselines, it achieves higher geometry and texture quality and enables practical capabilities such as retargeting, outpainting, and PBR material support, all with memory-efficient diffusion in latent space. This approach offers a practical path for high-quality 3D asset generation from a single exemplar, suitable for rapid content creation and editing in modern pipelines.

Abstract

Synthesizing novel 3D models that resemble the input example has long been pursued by graphics artists and machine learning researchers. In this paper, we present Sin3DM, a diffusion model that learns the internal patch distribution from a single 3D textured shape and generates high-quality variations with fine geometry and texture details. Training a diffusion model directly in 3D would induce large memory and computational cost. Therefore, we first compress the input into a lower-dimensional latent space and then train a diffusion model on it. Specifically, we encode the input 3D textured shape into triplane feature maps that represent the signed distance and texture fields of the input. The denoising network of our diffusion model has a limited receptive field to avoid overfitting, and uses triplane-aware 2D convolution blocks to improve the result quality. Aside from randomly generating new samples, our model also facilitates applications such as retargeting, outpainting and local editing. Through extensive qualitative and quantitative evaluation, we show that our method outperforms prior methods in generation quality of 3D shapes.

Sin3DM: Learning a Diffusion Model from a Single 3D Textured Shape

TL;DR

Sin3DM addresses single-instance 3D textured shape generation by learning a diffusion model in a compact triplane latent space, derived from a surface SDF and texture . It trains a small-receptive-field denoiser with triplane-aware convolutions to capture patch-level variations while preserving global structure, and decodes samples into textured meshes via marching cubes and texture mapping. Compared to baselines, it achieves higher geometry and texture quality and enables practical capabilities such as retargeting, outpainting, and PBR material support, all with memory-efficient diffusion in latent space. This approach offers a practical path for high-quality 3D asset generation from a single exemplar, suitable for rapid content creation and editing in modern pipelines.

Abstract

Synthesizing novel 3D models that resemble the input example has long been pursued by graphics artists and machine learning researchers. In this paper, we present Sin3DM, a diffusion model that learns the internal patch distribution from a single 3D textured shape and generates high-quality variations with fine geometry and texture details. Training a diffusion model directly in 3D would induce large memory and computational cost. Therefore, we first compress the input into a lower-dimensional latent space and then train a diffusion model on it. Specifically, we encode the input 3D textured shape into triplane feature maps that represent the signed distance and texture fields of the input. The denoising network of our diffusion model has a limited receptive field to avoid overfitting, and uses triplane-aware 2D convolution blocks to improve the result quality. Aside from randomly generating new samples, our model also facilitates applications such as retargeting, outpainting and local editing. Through extensive qualitative and quantitative evaluation, we show that our method outperforms prior methods in generation quality of 3D shapes.
Paper Structure (18 sections, 8 equations, 17 figures, 4 tables)

This paper contains 18 sections, 8 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Trained on a single 3D textured shape (left), Sin3DM is able to produce a diverse new samples, possibly of different sizes and aspect ratios. The generated shapes depict rich local variations with fine geometry and texture details, while retaining the global structure of the training example. Top: acropolis akropolis; bottom: industry house indhouse.
  • Figure 2: Method overview. Given an input 3D textured shape, we first train a triplane auto-encoder to compress it into an implicit triplane latent representation $\mathbf{h}$. Then we train a latent diffusion model on it to learn the distribution of triplane features. See Fig. \ref{['fig:unet']} for the structure of our denoising network $p_\theta$. At inference time, we sample a new triplane latent using the diffusion model and then decode it to a new 3D textured shape using the triplane decoder $\psi_\text{dec}$.
  • Figure 3: Left: denoising network structure. Our denoising network is a fully convolution U-Net composed of four ResBlocks and its bottleneck downsamples the input by $2$. Right: triplane-aware convolution block. A TriplaneConv block considers the relation between triplane feature maps. Inside ConvXY, we apply axis-wise average pooling to $\mathbf{h}_{xz}$ and $\mathbf{h}_{yz}$, yielding two feature vectors, which are then expanded to the original 2D dimension by replicating along $y$(or $x$) axis. The two expanded 2D feature maps are concatenated with $\mathbf{h}_{xy}$ and fed into a regular 2D convolution layer.
  • Figure 4: Retargeting results. By changing the spatial dimensions of the sampled Gaussian noise $\mathbf{h}_T$, we can resize the input to different sizes and aspect ratios. The training examples are labeled by blue boxes. From left to right, small town smalltown, wooden fence fence, train wagon trainwagon and antique pillar pillarantique.
  • Figure 5: Visual comparison. We compare the generated results from our method and SSG wu2022learning. The inputs of these two examples are shown in Fig. \ref{['fig:gallery']}. Note that our mesh surfaces are much cleaner (see the zoomed-in columns), and our textures have much more details (see the zoomed-in wood surfaces). Please see Fig. \ref{['fig:compare_li']} for the visual comparison to Sin3DGenli2023patch.
  • ...and 12 more figures