Table of Contents
Fetching ...

DiffSurf: A Transformer-based Diffusion Model for Generating and Reconstructing 3D Surfaces in Pose

Yusuke Yoshiyasu, Leyuan Sun

TL;DR

DiffSurf presents a transformer-based denoising diffusion model that generates and reconstructs 3D surfaces in diverse poses by operating directly on vertex coordinates and normals and conditioning on body joints. The approach unifies diffusion theory (including SDS and CFG) with a UniDiffuser-inspired diffusion transformer and a mesh up-sampler to produce dense meshes, enabling unconditional generation, pose conditioning, shape variation, and editing tasks across humans, mammals, and objects. Empirical results on multiple 3D datasets show improved diversity and quality over prior generative models and near real-time performance for single-image mesh recovery when using SDS-guided refinement. The work demonstrates the versatility of a single diffusion framework for 3D surface generation, editing, and 2D-to-3D fitting, with promising directions toward more expressive human meshes and larger-scale 3D data.

Abstract

This paper presents DiffSurf, a transformer-based denoising diffusion model for generating and reconstructing 3D surfaces. Specifically, we design a diffusion transformer architecture that predicts noise from noisy 3D surface vertices and normals. With this architecture, DiffSurf is able to generate 3D surfaces in various poses and shapes, such as human bodies, hands, animals and man-made objects. Further, DiffSurf is versatile in that it can address various 3D downstream tasks including morphing, body shape variation and 3D human mesh fitting to 2D keypoints. Experimental results on 3D human model benchmarks demonstrate that DiffSurf can generate shapes with greater diversity and higher quality than previous generative models. Furthermore, when applied to the task of single-image 3D human mesh recovery, DiffSurf achieves accuracy comparable to prior techniques at a near real-time rate.

DiffSurf: A Transformer-based Diffusion Model for Generating and Reconstructing 3D Surfaces in Pose

TL;DR

DiffSurf presents a transformer-based denoising diffusion model that generates and reconstructs 3D surfaces in diverse poses by operating directly on vertex coordinates and normals and conditioning on body joints. The approach unifies diffusion theory (including SDS and CFG) with a UniDiffuser-inspired diffusion transformer and a mesh up-sampler to produce dense meshes, enabling unconditional generation, pose conditioning, shape variation, and editing tasks across humans, mammals, and objects. Empirical results on multiple 3D datasets show improved diversity and quality over prior generative models and near real-time performance for single-image mesh recovery when using SDS-guided refinement. The work demonstrates the versatility of a single diffusion framework for 3D surface generation, editing, and 2D-to-3D fitting, with promising directions toward more expressive human meshes and larger-scale 3D data.

Abstract

This paper presents DiffSurf, a transformer-based denoising diffusion model for generating and reconstructing 3D surfaces. Specifically, we design a diffusion transformer architecture that predicts noise from noisy 3D surface vertices and normals. With this architecture, DiffSurf is able to generate 3D surfaces in various poses and shapes, such as human bodies, hands, animals and man-made objects. Further, DiffSurf is versatile in that it can address various 3D downstream tasks including morphing, body shape variation and 3D human mesh fitting to 2D keypoints. Experimental results on 3D human model benchmarks demonstrate that DiffSurf can generate shapes with greater diversity and higher quality than previous generative models. Furthermore, when applied to the task of single-image 3D human mesh recovery, DiffSurf achieves accuracy comparable to prior techniques at a near real-time rate.
Paper Structure (13 sections, 10 equations, 7 figures, 5 tables)

This paper contains 13 sections, 10 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: DiffSurf addresses the unconditional generation of 3D surfaces in diverse poses. It can generate 3D surfaces of various objects types such as humans, mammals and man-made objects. Downstream tasks, including unconditional generation, morphing and fitting to 2D key points can be addressed with pre-trained DiffSurf models.
  • Figure 2: Overview. DiffSurf consists of a diffusion transformer and an up-sampler. The diffusion transformer takes in the noisy 3D coordinates of surface vertices ${\bf x}_t \in \mathbb{R}^{N \times 3}$ and body joints ${\bf y}_t \in \mathbb{R}^{J \times 3}$. It processes these two modalities of data along with their corresponding timestep tokens $t_x$ and $t_y$. The transformer then outputs noise predictions for vertex and joint tokens, $\epsilon_\theta^x$ and $\epsilon_\theta^y$, respectively. For the 3D surface generation of man-made objects, we also input the noisy surface normals ${\bf n}_t \in \mathbb{R}^{N \times 3}$ corresponding to vertex tokens into the diffusion transformer. Once the 3D coordinates of surface vertices ${\bf v}$ (and normals ${\bf n}$) are generated, up-sampling is optionally performed to obtain the full dense surface output ${\bf V}$ (and ${\bf N}$).
  • Figure 3: w/o and with surface normals.
  • Figure 4: Control point deformation. Left: comparison of the loss terms. Right: deformation examples obtained using five control points (head, wrists and ankles, marked in red). The process starts with the rest pose and a human mesh is deformed to align with the control points. DiffSurf can deform a mesh into extremely different poses, such as a forward-bending posture.
  • Figure 5: Left: Unconditional generation and morphing of hands, dogs and animals. Right: class conditioned generation of man-made objects.
  • ...and 2 more figures