FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation

Hanxiao Wang; Yuan-Chen Guo; Ying-Tian Liu; Zi-Xin Zou; Biao Zhang; Weize Quan; Ding Liang; Yan-Pei Cao; Dong-Ming Yan

FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation

Hanxiao Wang, Yuan-Chen Guo, Ying-Tian Liu, Zi-Xin Zou, Biao Zhang, Weize Quan, Ding Liang, Yan-Pei Cao, Dong-Ming Yan

TL;DR

FACE is introduced, a novel Autoregressive Autoencoder (ARAE) framework that reconceptualizes the task by generating meshes at the face level by treating each triangle face, the fundamental building block of a mesh, as a single, unified token.

Abstract

Autoregressive models for 3D mesh generation suffer from a fundamental limitation: they flatten meshes into long vertex-coordinate sequences. This results in prohibitive computational costs, hindering the efficient synthesis of high-fidelity geometry. We argue this bottleneck stems from operating at the wrong semantic level. We introduce FACE, a novel Autoregressive Autoencoder (ARAE) framework that reconceptualizes the task by generating meshes at the face level. Our one-face-one-token strategy treats each triangle face, the fundamental building block of a mesh, as a single, unified token. This simple yet powerful design reduces the sequence length by a factor of nine, leading to an unprecedented compression ratio of 0.11, halving the previous state-of-the-art. This dramatic efficiency gain does not compromise quality; by pairing our face-level decoder with a powerful VecSet encoder, FACE achieves state-of-the-art reconstruction quality on standard benchmarks. The versatility of the learned latent space is further demonstrated by training a latent diffusion model that achieves high-fidelity, single-image-to-mesh generation. FACE provides a simple, scalable, and powerful paradigm that lowers the barrier to high-quality structured 3D content creation.

FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation

TL;DR

Abstract

Paper Structure (22 sections, 5 equations, 6 figures, 5 tables)

This paper contains 22 sections, 5 equations, 6 figures, 5 tables.

Introduction
Related Work
Methodology
Face-based Autoregressive Representation
Model Architecture
Shape Encoder
Autoregressive Face Decoder
Training Objective and Efficiency Analysis
End-to-End Training Objective
Efficiency Analysis
Image-to-Mesh via Latent Diffusion
Experiment
Implementation Details
Autoregressive Autoencoder.
Image-Conditioned Diffusion Transformer.
...and 7 more sections

Figures (6)

Figure 1: High-fidelity meshes reconstructed by FACE from point clouds. We introduce FACE, a novel Autoregressive Autoencoder (ARAE) driven by a new mesh compression strategy. This paradigm represents the mesh using a dramatically shorter sequence, achieving state-of-the-art efficiency while producing high-quality 3D geometry.
Figure 2: The end-to-end pipeline of our FACE model. An encoder compresses the input point cloud into a latent VecSet. An autoregressive decoder then conditions on this VecSet, generating the mesh face-by-face. A Face Embedding layer encoder 9 tokens of each face and a CausalMLP head decodes each latent face token into 9 quantized coordinate tokens.
Figure 3: Overview of our image-to-mesh generation pipeline. We first use the input image to condition a DiT model. The resulting latent VecSet is then fed into the Autoregressive Face Decoder to produce the final mesh.
Figure 4: Qualitative comparison of mesh reconstruction quality on the Toys4K dataset.
Figure 5: Qualitative comparison of image-conditioned mesh generation.
...and 1 more figures

FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation

TL;DR

Abstract

FACE: A Face-based Autoregressive Representation for High-Fidelity and Efficient Mesh Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)