Table of Contents
Fetching ...

Unified Cross-Scale 3D Generation and Understanding via Autoregressive Modeling

Shuqi Lu, Haowei Lin, Lin Yao, Zhifeng Gao, Xiaohong Ji, Yitao Liang, Weinan E, Linfeng Zhang, Guolin Ke

TL;DR

Uni-3DAR introduces a unified autoregressive framework for cross-scale 3D generation and understanding grounded in an octree-based coarse-to-fine tokenizer. It combines a 2-level subtree compression and a masked next-token prediction strategy to efficiently represent diverse 3D data from molecules to macroscopic shapes, enabling generation and understanding within a single model. Across 3D small molecules, crystals, macroscopic objects, protein pockets, docking, and pretraining tasks, Uni-3DAR achieves state-of-the-art or competitive results while delivering substantial speedups over diffusion-based methods. This cross-scale foundation model demonstrates strong generalization and efficiency, marking a step toward a general-purpose 3D foundation model for multi-domain scientific tasks.

Abstract

3D structure modeling is essential across scales, enabling applications from fluid simulation and 3D reconstruction to protein folding and molecular docking. Yet, despite shared 3D spatial patterns, current approaches remain fragmented, with models narrowly specialized for specific domains and unable to generalize across tasks or scales. We propose Uni-3DAR, a unified autoregressive framework for cross-scale 3D generation and understanding. At its core is a coarse-to-fine tokenizer based on octree data structures, which compresses diverse 3D structures into compact 1D token sequences. We further propose a two-level subtree compression strategy, which reduces the octree token sequence by up to 8x. To address the challenge of dynamically varying token positions introduced by compression, we introduce a masked next-token prediction strategy that ensures accurate positional modeling, significantly boosting model performance. Extensive experiments across multiple 3D generation and understanding tasks, including small molecules, proteins, polymers, crystals, and macroscopic 3D objects, validate its effectiveness and versatility. Notably, Uni-3DAR surpasses previous state-of-the-art diffusion models by a substantial margin, achieving up to 256\% relative improvement while delivering inference speeds up to 21.8x faster.

Unified Cross-Scale 3D Generation and Understanding via Autoregressive Modeling

TL;DR

Uni-3DAR introduces a unified autoregressive framework for cross-scale 3D generation and understanding grounded in an octree-based coarse-to-fine tokenizer. It combines a 2-level subtree compression and a masked next-token prediction strategy to efficiently represent diverse 3D data from molecules to macroscopic shapes, enabling generation and understanding within a single model. Across 3D small molecules, crystals, macroscopic objects, protein pockets, docking, and pretraining tasks, Uni-3DAR achieves state-of-the-art or competitive results while delivering substantial speedups over diffusion-based methods. This cross-scale foundation model demonstrates strong generalization and efficiency, marking a step toward a general-purpose 3D foundation model for multi-domain scientific tasks.

Abstract

3D structure modeling is essential across scales, enabling applications from fluid simulation and 3D reconstruction to protein folding and molecular docking. Yet, despite shared 3D spatial patterns, current approaches remain fragmented, with models narrowly specialized for specific domains and unable to generalize across tasks or scales. We propose Uni-3DAR, a unified autoregressive framework for cross-scale 3D generation and understanding. At its core is a coarse-to-fine tokenizer based on octree data structures, which compresses diverse 3D structures into compact 1D token sequences. We further propose a two-level subtree compression strategy, which reduces the octree token sequence by up to 8x. To address the challenge of dynamically varying token positions introduced by compression, we introduce a masked next-token prediction strategy that ensures accurate positional modeling, significantly boosting model performance. Extensive experiments across multiple 3D generation and understanding tasks, including small molecules, proteins, polymers, crystals, and macroscopic 3D objects, validate its effectiveness and versatility. Notably, Uni-3DAR surpasses previous state-of-the-art diffusion models by a substantial margin, achieving up to 256\% relative improvement while delivering inference speeds up to 21.8x faster.

Paper Structure

This paper contains 58 sections, 8 figures, 12 tables, 1 algorithm.

Figures (8)

  • Figure 1: Uni-3DAR Overview.(a) A coarse-to-fine octree-based tokenizer converts 3D structures into 1D sequences (details in \ref{['fig:octree']}). The tokens are modeled by an autoregressive transformer trained with masked next-token prediction (details in \ref{['fig:mntp']}) and can be optionally conditioned on cross-modal inputs (e.g., text, biological sequences, spectra). A single model supports single- and multi-frame generation as well as token- and structure-level understanding. (b) An example of octree from coarse level to fine level. Uni-3DAR generates tokens in a coarse-to-fine order: high-level occupancy tokens followed by level-0 tokens that capture local details (e.g., atom types and coordinates). The merits of octree over other 3D representations are discussed in \ref{['app:octree']}.
  • Figure 2: Overview of Uni-3DAR tokenization (illustrated in 2D using quadtree for clarity). (a) Adaptive coarse-to-fine subdivision of grid cells, where darker nodes indicate non-empty cells that can be further partitioned. (b) This partitioning process constructs an octree, providing a lossless compression of the full-size 3D grid. (c) Uni-3DAR’s tokenization consists of two components: hierarchical spatial compression via an octree and fine-grained structural tokenization. Each node's position is determined by its tree level and cell center. (d) The proposed 2-level subtree compression reduces the octree tokens by 8x (4x in the illustrated quadtree).
  • Figure 3: (a) Masked Next-Token Prediction. To handle the challenge of dynamically positioned tokens in sparse 3D structures, Uni-3DAR decouples position and content generation. Unlike standard next-token prediction, we first infer the next token's position from the octree hierarchy, place a "[MASK]" token, and then have the model predict only its content (e.g., occupancy or fine-grained properties). (b) Unified Framework for 3D Generation and Understanding. The Uni-3DAR architecture is a versatile, multi-task model. It supports autoregressive generation of complex 3D structures (blue arrows) and can be prompted to perform both token-level (green arrows) and structure-level (blue box) understanding tasks within a single framework.
  • Figure 4: Left: Uni-3DAR generation speed on different batch sizes compared with the diffusion-based method; Right: Uni-3DAR generation speed on different rank ratios $r$ compared with the diffusion-based method (higher is better).
  • Figure SI-1: Unconditional 3D molecular generation samples of QM9 dataset.
  • ...and 3 more figures