Table of Contents
Fetching ...

iFlame: Interleaving Full and Linear Attention for Efficient Mesh Generation

Hanxiao Wang, Biao Zhang, Weize Quan, Dong-Ming Yan, Peter Wonka

TL;DR

iFlame tackles the quadratic cost of full attention in autoregressive mesh generation by introducing an interleaved full/linear attention design within an hourglass transformer. This approach preserves expressive power while dramatically improving training and inference efficiency, achieving up to 1.8x throughput and substantial cache reductions with comparable mesh quality on ShapeNet and Objaverse up to 4,000 faces, thanks to $O(nd^2)$ training and $O(d^2)$ inference complexity for the linear component. The method uses a mesh-specific autoregressive representation with coordinate quantization and a multi-scale hourglass pipeline to capture geometry across scales, coupled with a cache-efficient inference strategy and selective token processing. The results demonstrate that high-resolution unconditional mesh generation can be practical on modest hardware, enabling scalable generation of complex 3D meshes.

Abstract

This paper propose iFlame, a novel transformer-based network architecture for mesh generation. While attention-based models have demonstrated remarkable performance in mesh generation, their quadratic computational complexity limits scalability, particularly for high-resolution 3D data. Conversely, linear attention mechanisms offer lower computational costs but often struggle to capture long-range dependencies, resulting in suboptimal outcomes. To address this trade-off, we propose an interleaving autoregressive mesh generation framework that combines the efficiency of linear attention with the expressive power of full attention mechanisms. To further enhance efficiency and leverage the inherent structure of mesh representations, we integrate this interleaving approach into an hourglass architecture, which significantly boosts efficiency. Our approach reduces training time while achieving performance comparable to pure attention-based models. To improve inference efficiency, we implemented a caching algorithm that almost doubles the speed and reduces the KV cache size by seven-eighths compared to the original Transformer. We evaluate our framework on ShapeNet and Objaverse, demonstrating its ability to generate high-quality 3D meshes efficiently. Our results indicate that the proposed interleaving framework effectively balances computational efficiency and generative performance, making it a practical solution for mesh generation. The training takes only 2 days with 4 GPUs on 39k data with a maximum of 4k faces on Objaverse.

iFlame: Interleaving Full and Linear Attention for Efficient Mesh Generation

TL;DR

iFlame tackles the quadratic cost of full attention in autoregressive mesh generation by introducing an interleaved full/linear attention design within an hourglass transformer. This approach preserves expressive power while dramatically improving training and inference efficiency, achieving up to 1.8x throughput and substantial cache reductions with comparable mesh quality on ShapeNet and Objaverse up to 4,000 faces, thanks to training and inference complexity for the linear component. The method uses a mesh-specific autoregressive representation with coordinate quantization and a multi-scale hourglass pipeline to capture geometry across scales, coupled with a cache-efficient inference strategy and selective token processing. The results demonstrate that high-resolution unconditional mesh generation can be practical on modest hardware, enabling scalable generation of complex 3D meshes.

Abstract

This paper propose iFlame, a novel transformer-based network architecture for mesh generation. While attention-based models have demonstrated remarkable performance in mesh generation, their quadratic computational complexity limits scalability, particularly for high-resolution 3D data. Conversely, linear attention mechanisms offer lower computational costs but often struggle to capture long-range dependencies, resulting in suboptimal outcomes. To address this trade-off, we propose an interleaving autoregressive mesh generation framework that combines the efficiency of linear attention with the expressive power of full attention mechanisms. To further enhance efficiency and leverage the inherent structure of mesh representations, we integrate this interleaving approach into an hourglass architecture, which significantly boosts efficiency. Our approach reduces training time while achieving performance comparable to pure attention-based models. To improve inference efficiency, we implemented a caching algorithm that almost doubles the speed and reduces the KV cache size by seven-eighths compared to the original Transformer. We evaluate our framework on ShapeNet and Objaverse, demonstrating its ability to generate high-quality 3D meshes efficiently. Our results indicate that the proposed interleaving framework effectively balances computational efficiency and generative performance, making it a practical solution for mesh generation. The training takes only 2 days with 4 GPUs on 39k data with a maximum of 4k faces on Objaverse.

Paper Structure

This paper contains 22 sections, 4 equations, 9 figures, 4 tables, 2 algorithms.

Figures (9)

  • Figure 1: Performance comparison of our iFlame architecture. (a) Our model achieves 1.8$\times$ higher inference throughput (81.9 t/s vs. 45.0 t/s). (b) Our model maintains low KV cache usage (0.8GB) while full attention requires 8.3$\times$ more memory when generating 4000 faces. (c, d, e) Our model reduces training time by 46% (227 min vs. 422 min), requires 38% less GPU memory during training (28GB vs. 45GB per GPU), and maintains face accuracy (78.1% vs. 78.3%) compared to baseline methods on ShapeNet with 2B tokens.
  • Figure 1: More generative results on Objaverse.
  • Figure 2: The architecture of our iFlame
  • Figure 2: More generative results on Objaverse.
  • Figure 3: Comparison of 3D mesh generation quality between MeshGPT (197M parameters) and our model iFlame (76M parameters) on ShapeNet.
  • ...and 4 more figures