Unified Cross-Scale 3D Generation and Understanding via Autoregressive Modeling
Shuqi Lu, Haowei Lin, Lin Yao, Zhifeng Gao, Xiaohong Ji, Yitao Liang, Weinan E, Linfeng Zhang, Guolin Ke
TL;DR
Uni-3DAR introduces a unified autoregressive framework for cross-scale 3D generation and understanding grounded in an octree-based coarse-to-fine tokenizer. It combines a 2-level subtree compression and a masked next-token prediction strategy to efficiently represent diverse 3D data from molecules to macroscopic shapes, enabling generation and understanding within a single model. Across 3D small molecules, crystals, macroscopic objects, protein pockets, docking, and pretraining tasks, Uni-3DAR achieves state-of-the-art or competitive results while delivering substantial speedups over diffusion-based methods. This cross-scale foundation model demonstrates strong generalization and efficiency, marking a step toward a general-purpose 3D foundation model for multi-domain scientific tasks.
Abstract
3D structure modeling is essential across scales, enabling applications from fluid simulation and 3D reconstruction to protein folding and molecular docking. Yet, despite shared 3D spatial patterns, current approaches remain fragmented, with models narrowly specialized for specific domains and unable to generalize across tasks or scales. We propose Uni-3DAR, a unified autoregressive framework for cross-scale 3D generation and understanding. At its core is a coarse-to-fine tokenizer based on octree data structures, which compresses diverse 3D structures into compact 1D token sequences. We further propose a two-level subtree compression strategy, which reduces the octree token sequence by up to 8x. To address the challenge of dynamically varying token positions introduced by compression, we introduce a masked next-token prediction strategy that ensures accurate positional modeling, significantly boosting model performance. Extensive experiments across multiple 3D generation and understanding tasks, including small molecules, proteins, polymers, crystals, and macroscopic 3D objects, validate its effectiveness and versatility. Notably, Uni-3DAR surpasses previous state-of-the-art diffusion models by a substantial margin, achieving up to 256\% relative improvement while delivering inference speeds up to 21.8x faster.
