Table of Contents
Fetching ...

LATTICE: Democratize High-Fidelity 3D Generation at Scale

Zeqiang Lai, Yunfei Zhao, Zibo Zhao, Haolin Liu, Qingxiang Lin, Jingwei Huang, Chunchao Guo, Xiangyu Yue

TL;DR

The paper targets the fidelity-scalability gap in 3D asset generation by introducing VoxSet, a semi-structured latent representation anchored to a coarse voxel grid, and the LATTICE framework, a two-stage pipeline that first seeds sparse geometry and then refines detailed geometry with a rectified-flow transformer. VoxSet enables arbitrary-resolution decoding and strong test-time scaling, while RoPE conditioning and progressive token growth improve convergence and detail. The approach achieves state-of-the-art reconstruction and generation performance with low training cost and demonstrates robust test-time token scaling, enabling scalable, high-fidelity 3D content from a single image. Overall, LATTICE offers a practical path toward scalable, high-quality 3D asset generation for visual effects, gaming, and design pipelines.

Abstract

We present LATTICE, a new framework for high-fidelity 3D asset generation that bridges the quality and scalability gap between 3D and 2D generative models. While 2D image synthesis benefits from fixed spatial grids and well-established transformer architectures, 3D generation remains fundamentally more challenging due to the need to predict both spatial structure and detailed geometric surfaces from scratch. These challenges are exacerbated by the computational complexity of existing 3D representations and the lack of structured and scalable 3D asset encoding schemes. To address this, we propose VoxSet, a semi-structured representation that compresses 3D assets into a compact set of latent vectors anchored to a coarse voxel grid, enabling efficient and position-aware generation. VoxSet retains the simplicity and compression advantages of prior VecSet methods while introducing explicit structure into the latent space, allowing positional embeddings to guide generation and enabling strong token-level test-time scaling. Built upon this representation, LATTICE adopts a two-stage pipeline: first generating a sparse voxelized geometry anchor, then producing detailed geometry using a rectified flow transformer. Our method is simple at its core, but supports arbitrary resolution decoding, low-cost training, and flexible inference schemes, achieving state-of-the-art performance on various aspects, and offering a significant step toward scalable, high-quality 3D asset creation.

LATTICE: Democratize High-Fidelity 3D Generation at Scale

TL;DR

The paper targets the fidelity-scalability gap in 3D asset generation by introducing VoxSet, a semi-structured latent representation anchored to a coarse voxel grid, and the LATTICE framework, a two-stage pipeline that first seeds sparse geometry and then refines detailed geometry with a rectified-flow transformer. VoxSet enables arbitrary-resolution decoding and strong test-time scaling, while RoPE conditioning and progressive token growth improve convergence and detail. The approach achieves state-of-the-art reconstruction and generation performance with low training cost and demonstrates robust test-time token scaling, enabling scalable, high-fidelity 3D content from a single image. Overall, LATTICE offers a practical path toward scalable, high-quality 3D asset generation for visual effects, gaming, and design pipelines.

Abstract

We present LATTICE, a new framework for high-fidelity 3D asset generation that bridges the quality and scalability gap between 3D and 2D generative models. While 2D image synthesis benefits from fixed spatial grids and well-established transformer architectures, 3D generation remains fundamentally more challenging due to the need to predict both spatial structure and detailed geometric surfaces from scratch. These challenges are exacerbated by the computational complexity of existing 3D representations and the lack of structured and scalable 3D asset encoding schemes. To address this, we propose VoxSet, a semi-structured representation that compresses 3D assets into a compact set of latent vectors anchored to a coarse voxel grid, enabling efficient and position-aware generation. VoxSet retains the simplicity and compression advantages of prior VecSet methods while introducing explicit structure into the latent space, allowing positional embeddings to guide generation and enabling strong token-level test-time scaling. Built upon this representation, LATTICE adopts a two-stage pipeline: first generating a sparse voxelized geometry anchor, then producing detailed geometry using a rectified flow transformer. Our method is simple at its core, but supports arbitrary resolution decoding, low-cost training, and flexible inference schemes, achieving state-of-the-art performance on various aspects, and offering a significant step toward scalable, high-quality 3D asset creation.

Paper Structure

This paper contains 20 sections, 14 figures, 3 tables.

Figures (14)

  • Figure 1: High quality 3D assets generated by LATTICE from a single image.
  • Figure 2: Illustration of test-time scaling in our model. The model is trained with up to 6,144 tokens, but is evaluated under different token counts at test time, showing notable improvements.
  • Figure 3: LATTICE system: At its core is a novel VoxSet representation, enabling scalable 3D modeling from 0.6B to 4.5B.
  • Figure 4: Illustrations of different latent representations and different query types.
  • Figure 5: LATTICE Model Architecture: it features a two-stage coarse-to-fine pipeline and a novel VoxSet VAE and DiT.
  • ...and 9 more figures