Table of Contents
Fetching ...

Meshtron: High-Fidelity, Artist-Like 3D Mesh Generation at Scale

Zekun Hao, David W. Romero, Tsung-Yi Lin, Ming-Yu Liu

TL;DR

<3-5 sentence high-level summary> Meshtron tackles the challenge of generating high-fidelity, artist-like 3D meshes at scale directly from point clouds. It introduces an hourglass Transformer backbone, truncated-sequence training with sliding-window inference, and a robust mesh-sequence ordering enforcement to scale up to 64k faces with 1024-level coordinate resolution. The model achieves substantial memory savings and throughput gains, while delivering superior topology, detail, and generalization compared with prior artist-like mesh generators and iso-surface methods. This work significantly advances AI-assisted 3D asset creation for games, film, and virtual environments by enabling realistic, controllable remeshing at unprecedented scales.

Abstract

Meshes are fundamental representations of 3D surfaces. However, creating high-quality meshes is a labor-intensive task that requires significant time and expertise in 3D modeling. While a delicate object often requires over $10^4$ faces to be accurately modeled, recent attempts at generating artist-like meshes are limited to $1.6$K faces and heavy discretization of vertex coordinates. Hence, scaling both the maximum face count and vertex coordinate resolution is crucial to producing high-quality meshes of realistic, complex 3D objects. We present Meshtron, a novel autoregressive mesh generation model able to generate meshes with up to 64K faces at 1024-level coordinate resolution --over an order of magnitude higher face count and $8{\times}$ higher coordinate resolution than current state-of-the-art methods. Meshtron's scalability is driven by four key components: (1) an hourglass neural architecture, (2) truncated sequence training, (3) sliding window inference, (4) a robust sampling strategy that enforces the order of mesh sequences. This results in over $50{\%}$ less training memory, $2.5{\times}$ faster throughput, and better consistency than existing works. Meshtron generates meshes of detailed, complex 3D objects at unprecedented levels of resolution and fidelity, closely resembling those created by professional artists, and opening the door to more realistic generation of detailed 3D assets for animation, gaming, and virtual environments.

Meshtron: High-Fidelity, Artist-Like 3D Mesh Generation at Scale

TL;DR

<3-5 sentence high-level summary> Meshtron tackles the challenge of generating high-fidelity, artist-like 3D meshes at scale directly from point clouds. It introduces an hourglass Transformer backbone, truncated-sequence training with sliding-window inference, and a robust mesh-sequence ordering enforcement to scale up to 64k faces with 1024-level coordinate resolution. The model achieves substantial memory savings and throughput gains, while delivering superior topology, detail, and generalization compared with prior artist-like mesh generators and iso-surface methods. This work significantly advances AI-assisted 3D asset creation for games, film, and virtual environments by enabling realistic, controllable remeshing at unprecedented scales.

Abstract

Meshes are fundamental representations of 3D surfaces. However, creating high-quality meshes is a labor-intensive task that requires significant time and expertise in 3D modeling. While a delicate object often requires over faces to be accurately modeled, recent attempts at generating artist-like meshes are limited to K faces and heavy discretization of vertex coordinates. Hence, scaling both the maximum face count and vertex coordinate resolution is crucial to producing high-quality meshes of realistic, complex 3D objects. We present Meshtron, a novel autoregressive mesh generation model able to generate meshes with up to 64K faces at 1024-level coordinate resolution --over an order of magnitude higher face count and higher coordinate resolution than current state-of-the-art methods. Meshtron's scalability is driven by four key components: (1) an hourglass neural architecture, (2) truncated sequence training, (3) sliding window inference, (4) a robust sampling strategy that enforces the order of mesh sequences. This results in over less training memory, faster throughput, and better consistency than existing works. Meshtron generates meshes of detailed, complex 3D objects at unprecedented levels of resolution and fidelity, closely resembling those created by professional artists, and opening the door to more realistic generation of detailed 3D assets for animation, gaming, and virtual environments.

Paper Structure

This paper contains 21 sections, 2 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: https://research.nvidia.com/labs/dir/meshtron/ efficiently generates artist-style triangle or quad meshes of up to 64k triangle faces from point clouds. It sequentially generates mesh faces from bottom to top as illustrated by the color gradient. There are options to control the mesh density and produce quad-like topology.
  • Figure 2: Topology comparison of Meshtron and iso-surfacing methods DMTet shen2021dmtet and FlexiCubes shen2023flexicube. While iso-surfacing methods can produce meshes with high face counts, they often suffer from overly dense tesselation, bumpy artifacts, oversmoothing and insufficient geometric detail, making them noticeably different from artist-created meshes. In contrast, Meshtron produces meshes with high-quality topology, featuring high-geometric detail and well-structured tesselation that closely aligns with the standards of artist-created meshes.
  • Figure 3: Distribution of face count (\ref{['subfig:face_length_stats']}) and face size (\ref{['subfig:face_size_stats']}) in a dataset of 1m artist-crafted meshes. The average face count is 32k, an order of magnitude higher than what current methods can generate. Moreover, meshes with higher face counts tend to have smaller faces. To accurately capture these details, the 128-level vertex quantization used in prior works must be increased --here to 1024 levels.
  • Figure 4: Meshtron uses an Hourglass Transformer backbone with two shortening stages of factor 3${\times}$, conditioned on point-cloud, face-count and quad face ratio. Latent tokens are color-coded to show their relationship with the mesh sequence. Tokens at each shortened stage align with the vertices and faces of the mesh sequence, providing good inductive bias for mesh modeling.
  • Figure 5: Not all mesh tokens are equal. (\ref{['subfig:shared_vertex']}) illustrates ordering of tokens in mesh sequences. (\ref{['subfig:ppl_periodicity']}) shows per-token log perplexity averaged over 1k mesh sequences (top) and the compute allocated per token by the Hourglass architecture. Groups of 9 tokens forming a triangle are marked with dashed vertical lines. Earlier tokens in a triangle show lower perplexity, as the first two vertices are often shared with previous triangles. The last vertex is less constrained, therefore introducing greater uncertainty. The Hourglass Transformer captures this periodicity and allocates more compute to high-perplexity token positions, making it more effective for mesh generation --see animation https://research.nvidia.com/labs/dir/meshtron.
  • ...and 10 more figures