Table of Contents
Fetching ...

MAR-3D: Progressive Masked Auto-regressor for High-Resolution 3D Generation

Jinnan Chen, Lingting Zhu, Zeyu Hu, Shengju Qian, Yugang Chen, Xin Wang, Gim Hee Lee

TL;DR

MAR-3D introduces a progressive 3D mesh generation framework that unifies a Pyramid VAE with cascaded Masked Auto-Regressive Models to up-sample latent tokens from low to high resolution. By employing random masking during training and a diffusion-driven objective, the approach effectively handles the unordered nature of 3D latent tokens and mitigates quantization losses seen in prior methods. The system leverages conditional cues from CLIP and DINOv2 for semantic guidance and uses condition augmentation to reduce error propagation across resolution scales, enabling efficient and scalable high-resolution mesh generation. Empirical results on public benchmarks show state-of-the-art geometric fidelity and strong generalization to unseen data, with clear improvements over diffusion-only and joint-distribution models. Overall, MAR-3D demonstrates that decomposing joint distributions into temporal (diffusion) and spatial (auto-regressive) components with progressive up-sampling yields scalable, high-quality 3D generation suitable for open-world applications.

Abstract

Recent advances in auto-regressive transformers have revolutionized generative modeling across different domains, from language processing to visual generation, demonstrating remarkable capabilities. However, applying these advances to 3D generation presents three key challenges: the unordered nature of 3D data conflicts with sequential next-token prediction paradigm, conventional vector quantization approaches incur substantial compression loss when applied to 3D meshes, and the lack of efficient scaling strategies for higher resolution latent prediction. To address these challenges, we introduce MAR-3D, which integrates a pyramid variational autoencoder with a cascaded masked auto-regressive transformer (Cascaded MAR) for progressive latent upscaling in the continuous space. Our architecture employs random masking during training and auto-regressive denoising in random order during inference, naturally accommodating the unordered property of 3D latent tokens. Additionally, we propose a cascaded training strategy with condition augmentation that enables efficiently up-scale the latent token resolution with fast convergence. Extensive experiments demonstrate that MAR-3D not only achieves superior performance and generalization capabilities compared to existing methods but also exhibits enhanced scaling capabilities compared to joint distribution modeling approaches (e.g., diffusion transformers).

MAR-3D: Progressive Masked Auto-regressor for High-Resolution 3D Generation

TL;DR

MAR-3D introduces a progressive 3D mesh generation framework that unifies a Pyramid VAE with cascaded Masked Auto-Regressive Models to up-sample latent tokens from low to high resolution. By employing random masking during training and a diffusion-driven objective, the approach effectively handles the unordered nature of 3D latent tokens and mitigates quantization losses seen in prior methods. The system leverages conditional cues from CLIP and DINOv2 for semantic guidance and uses condition augmentation to reduce error propagation across resolution scales, enabling efficient and scalable high-resolution mesh generation. Empirical results on public benchmarks show state-of-the-art geometric fidelity and strong generalization to unseen data, with clear improvements over diffusion-only and joint-distribution models. Overall, MAR-3D demonstrates that decomposing joint distributions into temporal (diffusion) and spatial (auto-regressive) components with progressive up-sampling yields scalable, high-quality 3D generation suitable for open-world applications.

Abstract

Recent advances in auto-regressive transformers have revolutionized generative modeling across different domains, from language processing to visual generation, demonstrating remarkable capabilities. However, applying these advances to 3D generation presents three key challenges: the unordered nature of 3D data conflicts with sequential next-token prediction paradigm, conventional vector quantization approaches incur substantial compression loss when applied to 3D meshes, and the lack of efficient scaling strategies for higher resolution latent prediction. To address these challenges, we introduce MAR-3D, which integrates a pyramid variational autoencoder with a cascaded masked auto-regressive transformer (Cascaded MAR) for progressive latent upscaling in the continuous space. Our architecture employs random masking during training and auto-regressive denoising in random order during inference, naturally accommodating the unordered property of 3D latent tokens. Additionally, we propose a cascaded training strategy with condition augmentation that enables efficiently up-scale the latent token resolution with fast convergence. Extensive experiments demonstrate that MAR-3D not only achieves superior performance and generalization capabilities compared to existing methods but also exhibits enhanced scaling capabilities compared to joint distribution modeling approaches (e.g., diffusion transformers).

Paper Structure

This paper contains 37 sections, 9 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of MAR-3D: (a) Pyramid VAE: It processes learnable tokens through separate cross-attention layers, taking multi-resolution point clouds and normals as input to generate occupancy fields. (b) Cascaded MAR: Conditioned on image features extracted by CLIP and DINOv2, we employ a cascaded design: a MAR-LR model for generating low-resolution tokens, and a MAR-HR model for high-resolution token. The MAR architecture details are illustrated in the blue box. While MAR-LR and MAR-HR share the same architecture, they differ in the inputs: MAR-HR additionally requires low-resolution tokens as input (shown in the dashed box).
  • Figure 2: Comparison on rendered normal map. We visualize the normal map rendered by our method and other baseline methods.
  • Figure 3: VAE Metrics with varying number of tokens. We show the reconstructed mesh CD and IoU of our Pyramid VAE vs the original VAE in terms of different number of tokens.
  • Figure 4: Ablation study on token resolution and model scaling strategies. Results (a)-(h) demonstrate different model configurations and settings, with detailed analysis provided in the main text.
  • Figure 5: Visual comparison of VAE reconstruction. (a)-(c) show reconstruction results from single-level VAE compressed with 256, 1024, and 2048 latent tokens respectively. (d) demonstrates the result from our Pyramid VAE using 1024 tokens.