DiffSurf: A Transformer-based Diffusion Model for Generating and Reconstructing 3D Surfaces in Pose
Yusuke Yoshiyasu, Leyuan Sun
TL;DR
DiffSurf presents a transformer-based denoising diffusion model that generates and reconstructs 3D surfaces in diverse poses by operating directly on vertex coordinates and normals and conditioning on body joints. The approach unifies diffusion theory (including SDS and CFG) with a UniDiffuser-inspired diffusion transformer and a mesh up-sampler to produce dense meshes, enabling unconditional generation, pose conditioning, shape variation, and editing tasks across humans, mammals, and objects. Empirical results on multiple 3D datasets show improved diversity and quality over prior generative models and near real-time performance for single-image mesh recovery when using SDS-guided refinement. The work demonstrates the versatility of a single diffusion framework for 3D surface generation, editing, and 2D-to-3D fitting, with promising directions toward more expressive human meshes and larger-scale 3D data.
Abstract
This paper presents DiffSurf, a transformer-based denoising diffusion model for generating and reconstructing 3D surfaces. Specifically, we design a diffusion transformer architecture that predicts noise from noisy 3D surface vertices and normals. With this architecture, DiffSurf is able to generate 3D surfaces in various poses and shapes, such as human bodies, hands, animals and man-made objects. Further, DiffSurf is versatile in that it can address various 3D downstream tasks including morphing, body shape variation and 3D human mesh fitting to 2D keypoints. Experimental results on 3D human model benchmarks demonstrate that DiffSurf can generate shapes with greater diversity and higher quality than previous generative models. Furthermore, when applied to the task of single-image 3D human mesh recovery, DiffSurf achieves accuracy comparable to prior techniques at a near real-time rate.
