Table of Contents
Fetching ...

Direct3D-S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention

Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Yikang Yang, Yajie Bao, Jiachen Qian, Siyu Zhu, Xun Cao, Philip Torr, Yao Yao

TL;DR

Direct3D-S2 tackles the memory and compute barrier of high-resolution 3D generation by introducing Spatial Sparse Attention (SSA) to diffusion-transformer computations on sparse volumetric tokens. A fully end-to-end Sparse SDF VAE (SS-VAE) preserves sparse representations across input, latent, and output, while a Rectified Flow-based DiT leverages SSA for efficient, scalable 1024^3 generation. Key innovations include a three-module SSA (sparse 3D compression, spatial blockwise selection, sparse 3D window) with gating and a sparse conditioning mechanism to focus on foreground evidence, enabling high-quality gigascale 3D outputs on 8 GPUs. Empirical results show superior generation quality and substantial speedups over dense or naive attention, making 1024^3 3D generation practical on modest hardware and setting a new efficiency benchmark for explicit 3D diffusion methods.

Abstract

Generating high-resolution 3D shapes using volumetric representations such as Signed Distance Functions (SDFs) presents substantial computational and memory challenges. We introduce Direct3D-S2, a scalable 3D generation framework based on sparse volumes that achieves superior output quality with dramatically reduced training costs. Our key innovation is the Spatial Sparse Attention (SSA) mechanism, which greatly enhances the efficiency of Diffusion Transformer (DiT) computations on sparse volumetric data. SSA allows the model to effectively process large token sets within sparse volumes, substantially reducing computational overhead and achieving a 3.9x speedup in the forward pass and a 9.6x speedup in the backward pass. Our framework also includes a variational autoencoder (VAE) that maintains a consistent sparse volumetric format across input, latent, and output stages. Compared to previous methods with heterogeneous representations in 3D VAE, this unified design significantly improves training efficiency and stability. Our model is trained on public available datasets, and experiments demonstrate that Direct3D-S2 not only surpasses state-of-the-art methods in generation quality and efficiency, but also enables training at 1024 resolution using only 8 GPUs, a task typically requiring at least 32 GPUs for volumetric representations at 256 resolution, thus making gigascale 3D generation both practical and accessible. Project page: https://www.neural4d.com/research/direct3d-s2.

Direct3D-S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention

TL;DR

Direct3D-S2 tackles the memory and compute barrier of high-resolution 3D generation by introducing Spatial Sparse Attention (SSA) to diffusion-transformer computations on sparse volumetric tokens. A fully end-to-end Sparse SDF VAE (SS-VAE) preserves sparse representations across input, latent, and output, while a Rectified Flow-based DiT leverages SSA for efficient, scalable 1024^3 generation. Key innovations include a three-module SSA (sparse 3D compression, spatial blockwise selection, sparse 3D window) with gating and a sparse conditioning mechanism to focus on foreground evidence, enabling high-quality gigascale 3D outputs on 8 GPUs. Empirical results show superior generation quality and substantial speedups over dense or naive attention, making 1024^3 3D generation practical on modest hardware and setting a new efficiency benchmark for explicit 3D diffusion methods.

Abstract

Generating high-resolution 3D shapes using volumetric representations such as Signed Distance Functions (SDFs) presents substantial computational and memory challenges. We introduce Direct3D-S2, a scalable 3D generation framework based on sparse volumes that achieves superior output quality with dramatically reduced training costs. Our key innovation is the Spatial Sparse Attention (SSA) mechanism, which greatly enhances the efficiency of Diffusion Transformer (DiT) computations on sparse volumetric data. SSA allows the model to effectively process large token sets within sparse volumes, substantially reducing computational overhead and achieving a 3.9x speedup in the forward pass and a 9.6x speedup in the backward pass. Our framework also includes a variational autoencoder (VAE) that maintains a consistent sparse volumetric format across input, latent, and output stages. Compared to previous methods with heterogeneous representations in 3D VAE, this unified design significantly improves training efficiency and stability. Our model is trained on public available datasets, and experiments demonstrate that Direct3D-S2 not only surpasses state-of-the-art methods in generation quality and efficiency, but also enables training at 1024 resolution using only 8 GPUs, a task typically requiring at least 32 GPUs for volumetric representations at 256 resolution, thus making gigascale 3D generation both practical and accessible. Project page: https://www.neural4d.com/research/direct3d-s2.

Paper Structure

This paper contains 21 sections, 10 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: The framework of our Direct3D-S2. We propose a fully end-to-end sparse SDF VAE (SS-VAE), which employs a symmetric encoder-decoder network to efficiently encode high-resolution sparse SDF volumes into sparse latent representations $\mathbf{z}$. Then we train an image-conditioned diffusion transformer (SS-DiT) based on $\mathbf{z}$, and design a novel Spatial Sparse Attention (SSA) mechanism that substantially improves the training and inference efficiency of the DiT.
  • Figure 2: The framework of our Spatial Sparse Attention (SSA). We partition the input tokens into blocks based on their 3D coordinates, and then construct key-value pairs through three distinct modules. For each query token, we utilize sparse 3D compression module to capture global information, while the spatial blockwise selection module selects important blocks based on compression attention scores to extract fine-grained features, and the sparse 3D window module injects local features. Ultimately, we aggregate the final output of SSA from the three modules using predicted gate scores.
  • Figure 3: Qualitative comparisons between other image-to-3D methods and our approach.
  • Figure 4: User Study for Image-to-3D Generation.
  • Figure 5: Qualitative comparisons of VAE reconstruction results. Note that we used a latent token length of 4096 during the inference of Dora chen2024dora.
  • ...and 7 more figures