Table of Contents
Fetching ...

HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels

HunyuanWorld Team, Zhenwei Wang, Yuhao Liu, Junta Wu, Zixiao Gu, Haoyuan Wang, Xuhui Zuo, Tianyu Huang, Wenhuan Li, Sheng Zhang, Yihang Lian, Yulin Tsai, Lifu Wang, Sicong Liu, Puhua Jiang, Xianghui Yang, Dongyuan Guo, Yixuan Tang, Xinyue Mao, Jiaao Yu, Junlin Yu, Jihong Zhang, Meng Chen, Liang Dong, Yiwen Jia, Chao Zhang, Yonghao Tan, Hao Zhang, Zheng Ye, Peng He, Runzhou Wu, Minghui Chen, Zhan Li, Wangchen Qin, Lei Wang, Yifu Sun, Lin Niu, Xiang Yuan, Xiaofeng Yang, Yingping He, Jie Xiao, Yangyu Tao, Jianchen Zhu, Jinbao Xue, Kai Liu, Chongqing Zhao, Xinming Wu, Tian Liu, Peng Chen, Di Wang, Yuhong Liu, Linus, Jie Jiang, Tengfei Wang, Chunchao Guo

TL;DR

HunyuanWorld 1.0 tackles the challenge of generating immersive, explorable 3D worlds from text or images by marrying 360° panoramic world proxies with a semantically layered 3D mesh representation. The framework introduces Panorama-DiT panorama generation, agentic world layering, layer-wise depth-aligned reconstruction, and Voyager-based long-range extension, supplemented by strong system optimizations for efficiency. It demonstrates state-of-the-art performance in both panorama and 3D world generation and enables broad applications in VR, physical simulation, and interactive content creation. This work effectively bridges 2D panoramic priors and 3D scene representations, offering a scalable path toward interactive, exportable 3D worlds from natural language and imagery.

Abstract

Creating immersive and playable 3D worlds from texts or images remains a fundamental challenge in computer vision and graphics. Existing world generation approaches typically fall into two categories: video-based methods that offer rich diversity but lack 3D consistency and rendering efficiency, and 3D-based methods that provide geometric consistency but struggle with limited training data and memory-inefficient representations. To address these limitations, we present HunyuanWorld 1.0, a novel framework that combines the best of both worlds for generating immersive, explorable, and interactive 3D scenes from text and image conditions. Our approach features three key advantages: 1) 360° immersive experiences via panoramic world proxies; 2) mesh export capabilities for seamless compatibility with existing computer graphics pipelines; 3) disentangled object representations for augmented interactivity. The core of our framework is a semantically layered 3D mesh representation that leverages panoramic images as 360° world proxies for semantic-aware world decomposition and reconstruction, enabling the generation of diverse 3D worlds. Extensive experiments demonstrate that our method achieves state-of-the-art performance in generating coherent, explorable, and interactive 3D worlds while enabling versatile applications in virtual reality, physical simulation, game development, and interactive content creation.

HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels

TL;DR

HunyuanWorld 1.0 tackles the challenge of generating immersive, explorable 3D worlds from text or images by marrying 360° panoramic world proxies with a semantically layered 3D mesh representation. The framework introduces Panorama-DiT panorama generation, agentic world layering, layer-wise depth-aligned reconstruction, and Voyager-based long-range extension, supplemented by strong system optimizations for efficiency. It demonstrates state-of-the-art performance in both panorama and 3D world generation and enables broad applications in VR, physical simulation, and interactive content creation. This work effectively bridges 2D panoramic priors and 3D scene representations, offering a scalable path toward interactive, exportable 3D worlds from natural language and imagery.

Abstract

Creating immersive and playable 3D worlds from texts or images remains a fundamental challenge in computer vision and graphics. Existing world generation approaches typically fall into two categories: video-based methods that offer rich diversity but lack 3D consistency and rendering efficiency, and 3D-based methods that provide geometric consistency but struggle with limited training data and memory-inefficient representations. To address these limitations, we present HunyuanWorld 1.0, a novel framework that combines the best of both worlds for generating immersive, explorable, and interactive 3D scenes from text and image conditions. Our approach features three key advantages: 1) 360° immersive experiences via panoramic world proxies; 2) mesh export capabilities for seamless compatibility with existing computer graphics pipelines; 3) disentangled object representations for augmented interactivity. The core of our framework is a semantically layered 3D mesh representation that leverages panoramic images as 360° world proxies for semantic-aware world decomposition and reconstruction, enabling the generation of diverse 3D worlds. Extensive experiments demonstrate that our method achieves state-of-the-art performance in generating coherent, explorable, and interactive 3D worlds while enabling versatile applications in virtual reality, physical simulation, game development, and interactive content creation.

Paper Structure

This paper contains 15 sections, 14 figures, 4 tables.

Figures (14)

  • Figure 1: An overview of HunyuanWorld 1.0 applications.
  • Figure 2: An overview of HunyuanWorld 1.0 architecture for 3D world generation. Given a conditioned scene image or textual description, HunyuanWorld 1.0 generates layer-wise 3D worlds in mesh through a staged generative framework. We first leverage a diffusion model (Panorama-DiT) to generate a panoramic image, which serves as an initial world proxy for providing full 360° scene information. We then obtain semantically layered scene representations via world layering and reconstruction. To ensure layer-wise alignment of the reconstructed 3D world, we enhance the panoramic depth estimation model with a cross-layer depth alignment strategy. Also, users can obtain full 3D objects via image-to-3D generation or represent the sky as HDRI maps for downstream applications.
  • Figure 3: An overview of our panoramic data curation pipeline.
  • Figure 4: Visual results of image-to-panorama generation by HunyuanWorld 1.0.
  • Figure 5: Visual results of text-to-panorama generation by HunyuanWorld 1.0.
  • ...and 9 more figures