Table of Contents
Fetching ...

AToM: Amortized Text-to-Mesh using 2D Diffusion

Guocheng Qian, Junli Cao, Aliaksandr Siarohin, Yash Kant, Chaoyang Wang, Michael Vasilkovsky, Hsin-Ying Lee, Yuwei Fang, Ivan Skorokhodov, Peiye Zhuang, Igor Gilitschenski, Jian Ren, Bernard Ghanem, Kfir Aberman, Sergey Tulyakov

TL;DR

AToM tackles the inefficiency and limited generalization of per-prompt text-to-mesh methods by introducing an amortized, text-conditioned mesh generator trained across many prompts. It replaces HyperNetwork-style encodings with a 3D-aware text-to-triplane module and employs a two-stage training regime (NeuS-based volumetric warmup followed by high-resolution mesh refinement) to stabilize learning and scale to large prompt sets. Inference is fast (under 1 second) and requires no 3D supervision, with experiments showing clear performance gains over ATT3D and per-prompt baselines on large benchmarks like DF415 and Pig64. Limitations include dependence on the diffusion prior and topology constraints of the mesh representation, suggesting future work on higher-frequency priors and more expressive meshing schemes for further improvements.

Abstract

We introduce Amortized Text-to-Mesh (AToM), a feed-forward text-to-mesh framework optimized across multiple text prompts simultaneously. In contrast to existing text-to-3D methods that often entail time-consuming per-prompt optimization and commonly output representations other than polygonal meshes, AToM directly generates high-quality textured meshes in less than 1 second with around 10 times reduction in the training cost, and generalizes to unseen prompts. Our key idea is a novel triplane-based text-to-mesh architecture with a two-stage amortized optimization strategy that ensures stable training and enables scalability. Through extensive experiments on various prompt benchmarks, AToM significantly outperforms state-of-the-art amortized approaches with over 4 times higher accuracy (in DF415 dataset) and produces more distinguishable and higher-quality 3D outputs. AToM demonstrates strong generalizability, offering finegrained 3D assets for unseen interpolated prompts without further optimization during inference, unlike per-prompt solutions.

AToM: Amortized Text-to-Mesh using 2D Diffusion

TL;DR

AToM tackles the inefficiency and limited generalization of per-prompt text-to-mesh methods by introducing an amortized, text-conditioned mesh generator trained across many prompts. It replaces HyperNetwork-style encodings with a 3D-aware text-to-triplane module and employs a two-stage training regime (NeuS-based volumetric warmup followed by high-resolution mesh refinement) to stabilize learning and scale to large prompt sets. Inference is fast (under 1 second) and requires no 3D supervision, with experiments showing clear performance gains over ATT3D and per-prompt baselines on large benchmarks like DF415 and Pig64. Limitations include dependence on the diffusion prior and topology constraints of the mesh representation, suggesting future work on higher-frequency priors and more expressive meshing schemes for further improvements.

Abstract

We introduce Amortized Text-to-Mesh (AToM), a feed-forward text-to-mesh framework optimized across multiple text prompts simultaneously. In contrast to existing text-to-3D methods that often entail time-consuming per-prompt optimization and commonly output representations other than polygonal meshes, AToM directly generates high-quality textured meshes in less than 1 second with around 10 times reduction in the training cost, and generalizes to unseen prompts. Our key idea is a novel triplane-based text-to-mesh architecture with a two-stage amortized optimization strategy that ensures stable training and enables scalability. Through extensive experiments on various prompt benchmarks, AToM significantly outperforms state-of-the-art amortized approaches with over 4 times higher accuracy (in DF415 dataset) and produces more distinguishable and higher-quality 3D outputs. AToM demonstrates strong generalizability, offering finegrained 3D assets for unseen interpolated prompts without further optimization during inference, unlike per-prompt solutions.
Paper Structure (20 sections, 1 equation, 14 figures, 1 table, 1 algorithm)

This paper contains 20 sections, 1 equation, 14 figures, 1 table, 1 algorithm.

Figures (14)

  • Figure 2: Per-prompt text-to-mesh TextMesh generates high-quality results but demands expensive optimization. Naively extending ATT3D for mesh generation leads to divergent training and poor geometry. AToM introduces a triplane-based architecture with two-stage amortized optimization for enhanced stability. AToM efficiently generates textured meshes for various text prompts in under one second during inference.
  • Figure 3: Inference and training of AToM.AToM inference (up): AToM generates textured meshes from given prompts in less than a second in inference. The text-to-mesh generator proposed in AToM consists of three components: a) a text encoder that tokenizes the input prompt, b) a text-to-triplane network that outputs a triplane representation from the text embedding, and c) a 3D network that generates SDF, vertex deformation, and color to form a differential mesh from positions and triplane features. AToM Training (bottom): AToM utilizes a two-stage amortized optimization, where the first stage leverages stable volumetric optimization to train only the SDF and texture modules using low-resolution renders. The seconds stage uses mesh rasterization to optimize the whole network through high-resolution renders. In both stages, AToM is trained simultaneously on many prompts through the guidance of a text-to-image diffusion prior without any 3D data supervision.
  • Figure 4: Comparing AToM to AToM Per-Prompt on the Pig64 compositional prompt set ("a pig activity theme"), where each row and column represent a different activity and theme, respectively. The models are trained using 56 prompts and tested on all 64 prompts, while 8 unseen prompts are evaluated on the diagonal. As depicted in (a), AToM consistently generates pigs with a similar identity and a uniform orientation, indicating that AToM also promotes feature sharing across prompts, similar to ATT3D ATT3D. Also, AToM generates 3D content with consistent quality, while per-prompt optimization cannot as shown in (b). Additionally, per-prompt optimization is more prone to overlooking certain details, such as the top hat in row 2 column 4 and the shovel in row 4 column 2 in (b), while AToM preserves them. More importantly, AToM performs well on unseen prompts without further optimization, unlike the per-prompt solution.
  • Figure 5: Gallery of AToM evaluated in DF415. Here ̂ and $ denote "a zoomed out DSLR photo of" and "a DSLR photo of", respectively.
  • Figure 6: Compare AToM to ATT3D-IF$^{\dagger}$ evaluated in DF415. In each row, we mostly show results from two similar prompts. While ATT3D producing indistinguishable results for similar prompts, AToM handles the complexity of prompts and achieves significantly higher quality than ATT3D. ̂ in the text denotes “a zoomed out DSLR photo of ”. One can also observe clear improvements of AToM over the original ATT3D by cross-referencing with their paper.
  • ...and 9 more figures