Table of Contents
Fetching ...

FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation

Hanxiao Wang, Yuan-Chen Guo, Ying-Tian Liu, Zi-Xin Zou, Biao Zhang, Weize Quan, Ding Liang, Yan-Pei Cao, Dong-Ming Yan

TL;DR

FACE is introduced, a novel Autoregressive Autoencoder (ARAE) framework that reconceptualizes the task by generating meshes at the face level by treating each triangle face, the fundamental building block of a mesh, as a single, unified token.

Abstract

Autoregressive models for 3D mesh generation suffer from a fundamental limitation: they flatten meshes into long vertex-coordinate sequences. This results in prohibitive computational costs, hindering the efficient synthesis of high-fidelity geometry. We argue this bottleneck stems from operating at the wrong semantic level. We introduce FACE, a novel Autoregressive Autoencoder (ARAE) framework that reconceptualizes the task by generating meshes at the face level. Our one-face-one-token strategy treats each triangle face, the fundamental building block of a mesh, as a single, unified token. This simple yet powerful design reduces the sequence length by a factor of nine, leading to an unprecedented compression ratio of 0.11, halving the previous state-of-the-art. This dramatic efficiency gain does not compromise quality; by pairing our face-level decoder with a powerful VecSet encoder, FACE achieves state-of-the-art reconstruction quality on standard benchmarks. The versatility of the learned latent space is further demonstrated by training a latent diffusion model that achieves high-fidelity, single-image-to-mesh generation. FACE provides a simple, scalable, and powerful paradigm that lowers the barrier to high-quality structured 3D content creation.

FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation

TL;DR

FACE is introduced, a novel Autoregressive Autoencoder (ARAE) framework that reconceptualizes the task by generating meshes at the face level by treating each triangle face, the fundamental building block of a mesh, as a single, unified token.

Abstract

Autoregressive models for 3D mesh generation suffer from a fundamental limitation: they flatten meshes into long vertex-coordinate sequences. This results in prohibitive computational costs, hindering the efficient synthesis of high-fidelity geometry. We argue this bottleneck stems from operating at the wrong semantic level. We introduce FACE, a novel Autoregressive Autoencoder (ARAE) framework that reconceptualizes the task by generating meshes at the face level. Our one-face-one-token strategy treats each triangle face, the fundamental building block of a mesh, as a single, unified token. This simple yet powerful design reduces the sequence length by a factor of nine, leading to an unprecedented compression ratio of 0.11, halving the previous state-of-the-art. This dramatic efficiency gain does not compromise quality; by pairing our face-level decoder with a powerful VecSet encoder, FACE achieves state-of-the-art reconstruction quality on standard benchmarks. The versatility of the learned latent space is further demonstrated by training a latent diffusion model that achieves high-fidelity, single-image-to-mesh generation. FACE provides a simple, scalable, and powerful paradigm that lowers the barrier to high-quality structured 3D content creation.
Paper Structure (22 sections, 5 equations, 6 figures, 5 tables)

This paper contains 22 sections, 5 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: High-fidelity meshes reconstructed by FACE from point clouds. We introduce FACE, a novel Autoregressive Autoencoder (ARAE) driven by a new mesh compression strategy. This paradigm represents the mesh using a dramatically shorter sequence, achieving state-of-the-art efficiency while producing high-quality 3D geometry.
  • Figure 2: The end-to-end pipeline of our FACE model. An encoder compresses the input point cloud into a latent VecSet. An autoregressive decoder then conditions on this VecSet, generating the mesh face-by-face. A Face Embedding layer encoder 9 tokens of each face and a CausalMLP head decodes each latent face token into 9 quantized coordinate tokens.
  • Figure 3: Overview of our image-to-mesh generation pipeline. We first use the input image to condition a DiT model. The resulting latent VecSet is then fed into the Autoregressive Face Decoder to produce the final mesh.
  • Figure 4: Qualitative comparison of mesh reconstruction quality on the Toys4K dataset.
  • Figure 5: Qualitative comparison of image-conditioned mesh generation.
  • ...and 1 more figures