OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder

Sensen Gao; Zhaoqing Wang; Qihang Cao; Dongdong Yu; Changhu Wang; Tongliang Liu; Mingming Gong; Jiawang Bian

OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder

Sensen Gao, Zhaoqing Wang, Qihang Cao, Dongdong Yu, Changhu Wang, Tongliang Liu, Mingming Gong, Jiawang Bian

Abstract

Existing diffusion-based 3D scene generation methods primarily operate in 2D image/video latent spaces, which makes maintaining cross-view appearance and geometric consistency inherently challenging. To bridge this gap, we present OneWorld, a framework that performs diffusion directly within a coherent 3D representation space. Central to our approach is the 3D Unified Representation Autoencoder (3D-URAE); it leverages pretrained 3D foundation models and augments their geometry-centric nature by injecting appearance and distilling semantics into a unified 3D latent space. Furthermore, we introduce token-level Cross-View-Correspondence (CVC) consistency loss to explicitly enforce structural alignment across views, and propose Manifold-Drift Forcing (MDF) to mitigate train-inference exposure bias and shape a robust 3D manifold by mixing drifted and original representations. Comprehensive experiments demonstrate that OneWorld generates high-quality 3D scenes with superior cross-view consistency compared to state-of-the-art 2D-based methods. Our code will be available at https://github.com/SensenGao/OneWorld.

OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder

Abstract

Paper Structure (24 sections, 22 equations, 11 figures, 8 tables)

This paper contains 24 sections, 22 equations, 11 figures, 8 tables.

Introduction
Related Work
3D Scene Generation
Representation Autoencoder
Method
3D Unified Representation Autoencoder
Cross-view Correspondence
Manifold-Drift Forcing
Experiment
Training Details
Evaluation Protocols
3D Scene Generation
Ablation Study
Conclusion
Training hyperparameter settings
...and 9 more sections

Figures (11)

Figure 1: (a) OneWorld generates 3DGS from a single view and renders novel views. (b) Architecture: FlashWorld li2025flashworld diffuses in compressed video latents; Gen3R huang2026gen3r compresses 3D features to align a 3D foundation encoder with video latents but can only generate geometry and appearance separately, rather than jointly. OneWorld generates directly in a unified 3D representation space, without compression or separate generation. (c) Performance comparison on WorldScore duan2025worldscore and DL3DV ling2024dl3dv.
Figure 2: Overview of the proposed OneWorld framework.(a) We construct a unified 3D representation space by introducing appearance injection and semantic distillation. (b) During DiT peebles2023scalable training, we incorporate cross-view correspondence, preserving cross-view geometric token correspondences from the target view to the conditioned view. (c)Manifold-drift forcing: we augment the original 3D manifold by mixing ground-truth 3D features with sampled 3D features, enabling a more robust 3D decoder.
Figure 3: Visualization of the effects of appearance injection and semantic distillation. We visualize the reconstruction results with and without appearance injection and present feature visualizations with and without semantic distillation.
Figure 4: Qualitative visual results on one-view-based novel view generation.
Figure 5: Visualization of 3DGS and rendered novel views.
...and 6 more figures

OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder

Abstract

OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder

Authors

Abstract

Table of Contents

Figures (11)