Table of Contents
Fetching ...

Ultra3D: Efficient and High-Fidelity 3D Generation with Part Attention

Yiwen Chen, Zhihao Li, Yikai Wang, Hu Zhang, Qin Li, Chi Zhang, Guosheng Lin

TL;DR

The paper tackles the inefficiency of two-stage diffusion-based 3D generation on sparse voxels by decoupling coarse layout from fine geometry. It introduces Ultra3D, which first uses the compact VecSet representation to generate a coarse object layout, then refines per-voxel latents with Part Attention restricted to semantically coherent parts. A scalable part-annotation pipeline enables reliable part labels for many meshes, while Part Attention achieves up to 6.7x speed-up without sacrificing quality. Experiments demonstrate state-of-the-art fidelity at 1024 resolution and strong user-preference, with substantial efficiency gains over prior methods.

Abstract

Recent advances in sparse voxel representations have significantly improved the quality of 3D content generation, enabling high-resolution modeling with fine-grained geometry. However, existing frameworks suffer from severe computational inefficiencies due to the quadratic complexity of attention mechanisms in their two-stage diffusion pipelines. In this work, we propose Ultra3D, an efficient 3D generation framework that significantly accelerates sparse voxel modeling without compromising quality. Our method leverages the compact VecSet representation to efficiently generate a coarse object layout in the first stage, reducing token count and accelerating voxel coordinate prediction. To refine per-voxel latent features in the second stage, we introduce Part Attention, a geometry-aware localized attention mechanism that restricts attention computation within semantically consistent part regions. This design preserves structural continuity while avoiding unnecessary global attention, achieving up to 6.7x speed-up in latent generation. To support this mechanism, we construct a scalable part annotation pipeline that converts raw meshes into part-labeled sparse voxels. Extensive experiments demonstrate that Ultra3D supports high-resolution 3D generation at 1024 resolution and achieves state-of-the-art performance in both visual fidelity and user preference.

Ultra3D: Efficient and High-Fidelity 3D Generation with Part Attention

TL;DR

The paper tackles the inefficiency of two-stage diffusion-based 3D generation on sparse voxels by decoupling coarse layout from fine geometry. It introduces Ultra3D, which first uses the compact VecSet representation to generate a coarse object layout, then refines per-voxel latents with Part Attention restricted to semantically coherent parts. A scalable part-annotation pipeline enables reliable part labels for many meshes, while Part Attention achieves up to 6.7x speed-up without sacrificing quality. Experiments demonstrate state-of-the-art fidelity at 1024 resolution and strong user-preference, with substantial efficiency gains over prior methods.

Abstract

Recent advances in sparse voxel representations have significantly improved the quality of 3D content generation, enabling high-resolution modeling with fine-grained geometry. However, existing frameworks suffer from severe computational inefficiencies due to the quadratic complexity of attention mechanisms in their two-stage diffusion pipelines. In this work, we propose Ultra3D, an efficient 3D generation framework that significantly accelerates sparse voxel modeling without compromising quality. Our method leverages the compact VecSet representation to efficiently generate a coarse object layout in the first stage, reducing token count and accelerating voxel coordinate prediction. To refine per-voxel latent features in the second stage, we introduce Part Attention, a geometry-aware localized attention mechanism that restricts attention computation within semantically consistent part regions. This design preserves structural continuity while avoiding unnecessary global attention, achieving up to 6.7x speed-up in latent generation. To support this mechanism, we construct a scalable part annotation pipeline that converts raw meshes into part-labeled sparse voxels. Extensive experiments demonstrate that Ultra3D supports high-resolution 3D generation at 1024 resolution and achieves state-of-the-art performance in both visual fidelity and user preference.

Paper Structure

This paper contains 17 sections, 2 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Image-to-3D Generation Results of Ultra3D.Ultra3D delivers high-quality 3D meshes with fine-grained geometric details while maintaining efficient generation. Please zoom in to view detailed geometry.
  • Figure 2: Expeiments on different attention mechanisms. Each color denotes an attention group, within which attention is computed independently. All other settings remain unchanged, with only the attention mechanism being replaced. 3D Window Attention partitions the object space into 8 fixed regions by splitting at the center along each axis. This fixed partitioning often misaligns with semantic boundaries, leading to degraded quality and style inconsistencies.
  • Figure 3: Pipeline Overview. We introduce Ultra3D, an efficient and high-quality 3D generation framework that first generates sparse voxel layout via VecSet and then refines it by generating per-voxel latent. The core of Ultra3D is Part Attention, an efficient localized attention mechanism that performs attention computation independently within each part group. Besides, when the input condition is an image, each part group performs cross attention only with the image tokens onto which its voxel tokens are projected.
  • Figure 4: Impact of Resolution on Generation Quality. We compare results under different configurations, where “512_64” denotes a mesh resolution of 512 and a sparse voxel resolution of 64. In previous works, to reduce computational cost in the second stage, the sparse voxels are typically downsampled by half before attention computation in the DiT, then upsampled afterward—annotated as “Downsample” in the figure. As shown, both the mesh resolution and the sparse voxel resolution used during attention computation significantly impact the final quality. However, due to efficiency constraints, prior methods were limited to lower resolutions. In contrast, our efficient framework supports higher sparse voxel resolutions, making high-quality generation feasible.
  • Figure 5: Robustness of Part Annotation. Although our method is trained using data with exactly 8 part groups, we find it to be robust to variations in part annotation. Varying the number of part groups has little impact on generation quality, suggesting that increasing the number of annotated part groups can further accelerate computation without compromising performance.
  • ...and 2 more figures