MAR-3D: Progressive Masked Auto-regressor for High-Resolution 3D Generation
Jinnan Chen, Lingting Zhu, Zeyu Hu, Shengju Qian, Yugang Chen, Xin Wang, Gim Hee Lee
TL;DR
MAR-3D introduces a progressive 3D mesh generation framework that unifies a Pyramid VAE with cascaded Masked Auto-Regressive Models to up-sample latent tokens from low to high resolution. By employing random masking during training and a diffusion-driven objective, the approach effectively handles the unordered nature of 3D latent tokens and mitigates quantization losses seen in prior methods. The system leverages conditional cues from CLIP and DINOv2 for semantic guidance and uses condition augmentation to reduce error propagation across resolution scales, enabling efficient and scalable high-resolution mesh generation. Empirical results on public benchmarks show state-of-the-art geometric fidelity and strong generalization to unseen data, with clear improvements over diffusion-only and joint-distribution models. Overall, MAR-3D demonstrates that decomposing joint distributions into temporal (diffusion) and spatial (auto-regressive) components with progressive up-sampling yields scalable, high-quality 3D generation suitable for open-world applications.
Abstract
Recent advances in auto-regressive transformers have revolutionized generative modeling across different domains, from language processing to visual generation, demonstrating remarkable capabilities. However, applying these advances to 3D generation presents three key challenges: the unordered nature of 3D data conflicts with sequential next-token prediction paradigm, conventional vector quantization approaches incur substantial compression loss when applied to 3D meshes, and the lack of efficient scaling strategies for higher resolution latent prediction. To address these challenges, we introduce MAR-3D, which integrates a pyramid variational autoencoder with a cascaded masked auto-regressive transformer (Cascaded MAR) for progressive latent upscaling in the continuous space. Our architecture employs random masking during training and auto-regressive denoising in random order during inference, naturally accommodating the unordered property of 3D latent tokens. Additionally, we propose a cascaded training strategy with condition augmentation that enables efficiently up-scale the latent token resolution with fast convergence. Extensive experiments demonstrate that MAR-3D not only achieves superior performance and generalization capabilities compared to existing methods but also exhibits enhanced scaling capabilities compared to joint distribution modeling approaches (e.g., diffusion transformers).
