Table of Contents
Fetching ...

Matrix-3D: Omnidirectional Explorable 3D World Generation

Zhongqi Yang, Wenhang Ge, Yuqi Li, Jiaqi Chen, Haoyuan Li, Mengyin An, Fei Kang, Hua Xue, Baixin Xu, Yuyang Yin, Eric Li, Yang Liu, Yikai Wang, Hao-Xiang Guo, Yahui Zhou

TL;DR

Matrix-3D tackles omnidirectional explorable 3D world generation from image or text prompts by leveraging panoramic representations and trajectory-guided diffusion. It introduces a mesh-render conditioned panorama video diffusion model and two complementary 3D lifting pipelines—an optimization-based method for high-fidelity geometry and a fast feed-forward model for scalable reconstruction—along with the Matrix-Pano synthetic dataset. The approach achieves state-of-the-art results in panoramic video generation and 3D reconstruction, delivering improved visual quality, camera controllability, and geometric consistency for wide-coverage 3D worlds. This work advances spatial intelligence by enabling robust, large-scale 3D world modeling from minimal inputs and providing a rich dataset for training and evaluation.

Abstract

Explorable 3D world generation from a single image or text prompt forms a cornerstone of spatial intelligence. Recent works utilize video model to achieve wide-scope and generalizable 3D world generation. However, existing approaches often suffer from a limited scope in the generated scenes. In this work, we propose Matrix-3D, a framework that utilize panoramic representation for wide-coverage omnidirectional explorable 3D world generation that combines conditional video generation and panoramic 3D reconstruction. We first train a trajectory-guided panoramic video diffusion model that employs scene mesh renders as condition, to enable high-quality and geometrically consistent scene video generation. To lift the panorama scene video to 3D world, we propose two separate methods: (1) a feed-forward large panorama reconstruction model for rapid 3D scene reconstruction and (2) an optimization-based pipeline for accurate and detailed 3D scene reconstruction. To facilitate effective training, we also introduce the Matrix-Pano dataset, the first large-scale synthetic collection comprising 116K high-quality static panoramic video sequences with depth and trajectory annotations. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance in panoramic video generation and 3D world generation. See more in https://matrix-3d.github.io.

Matrix-3D: Omnidirectional Explorable 3D World Generation

TL;DR

Matrix-3D tackles omnidirectional explorable 3D world generation from image or text prompts by leveraging panoramic representations and trajectory-guided diffusion. It introduces a mesh-render conditioned panorama video diffusion model and two complementary 3D lifting pipelines—an optimization-based method for high-fidelity geometry and a fast feed-forward model for scalable reconstruction—along with the Matrix-Pano synthetic dataset. The approach achieves state-of-the-art results in panoramic video generation and 3D reconstruction, delivering improved visual quality, camera controllability, and geometric consistency for wide-coverage 3D worlds. This work advances spatial intelligence by enabling robust, large-scale 3D world modeling from minimal inputs and providing a rich dataset for training and evaluation.

Abstract

Explorable 3D world generation from a single image or text prompt forms a cornerstone of spatial intelligence. Recent works utilize video model to achieve wide-scope and generalizable 3D world generation. However, existing approaches often suffer from a limited scope in the generated scenes. In this work, we propose Matrix-3D, a framework that utilize panoramic representation for wide-coverage omnidirectional explorable 3D world generation that combines conditional video generation and panoramic 3D reconstruction. We first train a trajectory-guided panoramic video diffusion model that employs scene mesh renders as condition, to enable high-quality and geometrically consistent scene video generation. To lift the panorama scene video to 3D world, we propose two separate methods: (1) a feed-forward large panorama reconstruction model for rapid 3D scene reconstruction and (2) an optimization-based pipeline for accurate and detailed 3D scene reconstruction. To facilitate effective training, we also introduce the Matrix-Pano dataset, the first large-scale synthetic collection comprising 116K high-quality static panoramic video sequences with depth and trajectory annotations. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance in panoramic video generation and 3D world generation. See more in https://matrix-3d.github.io.

Paper Structure

This paper contains 30 sections, 3 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Matrix-3D can generate omnidirectional explorable 3D world from image or text input.
  • Figure 2: Comparison between perspective and panoramic images. Panoramic images can capture a significantly wider field of view than perspective images.
  • Figure 3: Core components of our framework. Given trajectory guidance in the form of scene mesh renderings and corresponding masks, obtained by rendering an estimated mesh along a user-defined camera trajectory, we train an image-to-video diffusion model to generate high-quality panoramic videos that precisely follow the specified trajectory. The generated 2D panoramic content is then lifted into an omnidirectional, explorable 3D world using a large-scale panorama reconstruction model.
  • Figure 4: Comparison of trajectory guidance derived from mesh and point cloud representations. Results guided by point clouds suffer from noticeable artifacts, which degrade generation quality.
  • Figure 5: Dataset Illustration. We present a scene from the dataset and the data collection process. Two points are firstly randomly sampled on the route and connected by their shortest path as the red line segments shows. Then, a Laplacian smoothing algorithm is applied on the initial path to reach the smooth greed path, where paranoma videos and depths will be recorded along with their camera poses. We also show the captured rgb image and depth map of three different frames on the bottom of the figure.
  • ...and 7 more figures