Table of Contents
Fetching ...

LDM: Large Tensorial SDF Model for Textured Mesh Generation

Rengan Xie, Wenting Zheng, Kai Huang, Yizheng Chen, Qi Wang, Qi Ye, Wei Chen, Yuchi Huo

TL;DR

This work tackles fast, high-quality 3D asset generation from text or a single image without per-object optimization. It introduces LDM, a feed-forward pipeline that uses conditional multi-view diffusion to generate four-view inputs and a transformer-based tensorial SDF reconstructor to produce a unified tensorial SDF field, followed by a gradient-based mesh refinement. The method represents geometry and appearance with a shared tensorial SDF and decouples color into albedo and shading, enabling reliable relighting and material editing. A two-stage training regime—volume rendering for global features and FlexiCube-based local refinement—yields high-quality textured meshes in seconds and outperforms prior methods on color and geometry metrics.

Abstract

Previous efforts have managed to generate production-ready 3D assets from text or images. However, these methods primarily employ NeRF or 3D Gaussian representations, which are not adept at producing smooth, high-quality geometries required by modern rendering pipelines. In this paper, we propose LDM, a novel feed-forward framework capable of generating high-fidelity, illumination-decoupled textured mesh from a single image or text prompts. We firstly utilize a multi-view diffusion model to generate sparse multi-view inputs from single images or text prompts, and then a transformer-based model is trained to predict a tensorial SDF field from these sparse multi-view image inputs. Finally, we employ a gradient-based mesh optimization layer to refine this model, enabling it to produce an SDF field from which high-quality textured meshes can be extracted. Extensive experiments demonstrate that our method can generate diverse, high-quality 3D mesh assets with corresponding decomposed RGB textures within seconds.

LDM: Large Tensorial SDF Model for Textured Mesh Generation

TL;DR

This work tackles fast, high-quality 3D asset generation from text or a single image without per-object optimization. It introduces LDM, a feed-forward pipeline that uses conditional multi-view diffusion to generate four-view inputs and a transformer-based tensorial SDF reconstructor to produce a unified tensorial SDF field, followed by a gradient-based mesh refinement. The method represents geometry and appearance with a shared tensorial SDF and decouples color into albedo and shading, enabling reliable relighting and material editing. A two-stage training regime—volume rendering for global features and FlexiCube-based local refinement—yields high-quality textured meshes in seconds and outperforms prior methods on color and geometry metrics.

Abstract

Previous efforts have managed to generate production-ready 3D assets from text or images. However, these methods primarily employ NeRF or 3D Gaussian representations, which are not adept at producing smooth, high-quality geometries required by modern rendering pipelines. In this paper, we propose LDM, a novel feed-forward framework capable of generating high-fidelity, illumination-decoupled textured mesh from a single image or text prompts. We firstly utilize a multi-view diffusion model to generate sparse multi-view inputs from single images or text prompts, and then a transformer-based model is trained to predict a tensorial SDF field from these sparse multi-view image inputs. Finally, we employ a gradient-based mesh optimization layer to refine this model, enabling it to produce an SDF field from which high-quality textured meshes can be extracted. Extensive experiments demonstrate that our method can generate diverse, high-quality 3D mesh assets with corresponding decomposed RGB textures within seconds.
Paper Structure (26 sections, 7 equations, 13 figures, 3 tables)

This paper contains 26 sections, 7 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Given a text prompt or a single image, our framework can generate corresponding high-quality 3D assets within seconds, including illumination-decoupled texture maps, facilitating integration into various applications, such as relighting and material editing.
  • Figure 2: The overview of our framework. When given an image or text prompt condition, we first utilize a diffusion model to generate multiple viewpoint images. These images are then encoded into image feature tokens using the DINO2 image encoder. Subsequently, these tokens are fed into a transform-based tensorial object reconstructor, resulting in a tensorial SDF representation. The tensorial SDF representation can be further rendered using volume rendering or the Flexicube render layer to produce images or extract meshes.
  • Figure 3: Comparing model training performance across different Beta schedules.
  • Figure 4: Qualitative comparison with baselines shows that our method produces high-quality 3D assets with smooth geometry and clear textures, which align well with the input image.
  • Figure 5: The effect of illumination decoupled texture. We perform relighting in new scenes for both illumination-decomposed textures and non-decomposed textures. The 3D assets without illumination decomposition display incorrect shadows in the new scenes.
  • ...and 8 more figures