LATTICE: Democratize High-Fidelity 3D Generation at Scale
Zeqiang Lai, Yunfei Zhao, Zibo Zhao, Haolin Liu, Qingxiang Lin, Jingwei Huang, Chunchao Guo, Xiangyu Yue
TL;DR
The paper targets the fidelity-scalability gap in 3D asset generation by introducing VoxSet, a semi-structured latent representation anchored to a coarse voxel grid, and the LATTICE framework, a two-stage pipeline that first seeds sparse geometry and then refines detailed geometry with a rectified-flow transformer. VoxSet enables arbitrary-resolution decoding and strong test-time scaling, while RoPE conditioning and progressive token growth improve convergence and detail. The approach achieves state-of-the-art reconstruction and generation performance with low training cost and demonstrates robust test-time token scaling, enabling scalable, high-fidelity 3D content from a single image. Overall, LATTICE offers a practical path toward scalable, high-quality 3D asset generation for visual effects, gaming, and design pipelines.
Abstract
We present LATTICE, a new framework for high-fidelity 3D asset generation that bridges the quality and scalability gap between 3D and 2D generative models. While 2D image synthesis benefits from fixed spatial grids and well-established transformer architectures, 3D generation remains fundamentally more challenging due to the need to predict both spatial structure and detailed geometric surfaces from scratch. These challenges are exacerbated by the computational complexity of existing 3D representations and the lack of structured and scalable 3D asset encoding schemes. To address this, we propose VoxSet, a semi-structured representation that compresses 3D assets into a compact set of latent vectors anchored to a coarse voxel grid, enabling efficient and position-aware generation. VoxSet retains the simplicity and compression advantages of prior VecSet methods while introducing explicit structure into the latent space, allowing positional embeddings to guide generation and enabling strong token-level test-time scaling. Built upon this representation, LATTICE adopts a two-stage pipeline: first generating a sparse voxelized geometry anchor, then producing detailed geometry using a rectified flow transformer. Our method is simple at its core, but supports arbitrary resolution decoding, low-cost training, and flexible inference schemes, achieving state-of-the-art performance on various aspects, and offering a significant step toward scalable, high-quality 3D asset creation.
