Table of Contents
Fetching ...

Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation

Chongjie Ye, Cheng Cao, Chuanyu Pan, Yiming Hao, Yihao Zhi, Yuanming Hu, Xiaoguang Han

Abstract

Recent multimodal large language models have achieved strong performance in unified text and image understanding and generation, yet extending such native capability to 3D remains challenging due to limited data. Compared to abundant 2D imagery, high-quality 3D assets are scarce, making 3D synthesis under-constrained. Existing methods often rely on indirect pipelines that edit in 2D and lift results into 3D via optimization, sacrificing geometric consistency. We present Omni123, a 3D-native foundation model that unifies text-to-2D and text-to-3D generation within a single autoregressive framework. Our key insight is that cross-modal consistency between images and 3D can serve as an implicit structural constraint. By representing text, images, and 3D as discrete tokens in a shared sequence space, the model leverages abundant 2D data as a geometric prior to improve 3D representations. We introduce an interleaved X-to-X training paradigm that coordinates diverse cross-modal tasks over heterogeneous paired datasets without requiring fully aligned text-image-3D triplets. By traversing semantic-visual-geometric cycles (e.g., text to image to 3D to image) within autoregressive sequences, the model jointly enforces semantic alignment, appearance fidelity, and multi-view geometric consistency. Experiments show that Omni123 significantly improves text-guided 3D generation and editing, demonstrating a scalable path toward multimodal 3D world models.

Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation

Abstract

Recent multimodal large language models have achieved strong performance in unified text and image understanding and generation, yet extending such native capability to 3D remains challenging due to limited data. Compared to abundant 2D imagery, high-quality 3D assets are scarce, making 3D synthesis under-constrained. Existing methods often rely on indirect pipelines that edit in 2D and lift results into 3D via optimization, sacrificing geometric consistency. We present Omni123, a 3D-native foundation model that unifies text-to-2D and text-to-3D generation within a single autoregressive framework. Our key insight is that cross-modal consistency between images and 3D can serve as an implicit structural constraint. By representing text, images, and 3D as discrete tokens in a shared sequence space, the model leverages abundant 2D data as a geometric prior to improve 3D representations. We introduce an interleaved X-to-X training paradigm that coordinates diverse cross-modal tasks over heterogeneous paired datasets without requiring fully aligned text-image-3D triplets. By traversing semantic-visual-geometric cycles (e.g., text to image to 3D to image) within autoregressive sequences, the model jointly enforces semantic alignment, appearance fidelity, and multi-view geometric consistency. Experiments show that Omni123 significantly improves text-guided 3D generation and editing, demonstrating a scalable path toward multimodal 3D world models.

Paper Structure

This paper contains 49 sections, 3 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Comparison of Text-to-3D paradigms
  • Figure 2: Omni123 enables native 3D generation and editing within a unified multimodal framework trained with limited 3D data. Top: Text-to-joint-2D-and-3D generation results across diverse prompts. Each example shows the generated 2D image (inset) alongside the normal map of the corresponding 3D output, demonstrating high geometric fidelity and strong semantic alignment with the input text. Bottom: Native 3D editing of 3D models through sequential and branching text instructions (e.g., "+wearing a kimono," "+riding a skateboard"), illustrating the model's ability to perform iterative and diverse 3D modifications directly in 3D space. More qualitative results are shown in Figures \ref{['fig:showcase_1']} and \ref{['fig:showcase_2']}.
  • Figure 3: Overview of the Omni123 architecture. Text is encoded by dual text encoders (CLIP radford2021learning and Qwen3-0.6B yang2025qwen3) and fed into a conditioning stream, while images and 3D shapes are tokenized into 1D discrete tokens and concatenated into a unified generation stream. The unified autoregressive transformer backbone uses 24 dual-stream blocks to jointly process the conditioning and generation tokens under causal attention, followed by 6 single-stream layers operating only on generation tokens, and finally with modality-specific linear heads decoding token logits over the 2D and 3D codebooks.
  • Figure 4: Two-stage image tokenizer training strategy. Stage 1 trains a continuous VAE (DINO-Tok) to learn high-fidelity visual representations. Stage 2 freezes the VAE and trains a 1D Q-Former to reconstruct the continuous features, reducing vector quantization to a compact 1D token extraction task.
  • Figure 5: Continued training introduces view-conditioned generation via learnable view tokens. Top left:$N{=}6$ view tokens corresponding to canonical viewpoints (front, back, left, right, top, bottom) are mapped from 3D camera extrinsics. Top right: A view token $\mathbf{v}$ is prepended to the target image tokens and fed into the self-attention stream, enabling viewpoint-controllable generation. Bottom: In the Posed Image $\to$ Posed Image task, the source image is accompanied by its own view token $\mathbf{v}_{\text{src}}$, providing explicit source--target viewpoint correspondence for novel view synthesis.
  • ...and 4 more figures