Table of Contents
Fetching ...

OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes

Yukun Huang, Jiwen Yu, Yanning Zhou, Jianan Wang, Xintao Wang, Pengfei Wan, Xihui Liu

TL;DR

OmniX addresses the challenge of building graphics-ready 3D scenes from panoramas by reusing pre-trained 2D flow-matching priors for unified panoramic generation, perception, and completion. It introduces a cross-modal Separate-Adapter design and the PanoX synthetic panorama dataset to enable RGB-to-X panoramas and intrinsic decomposition, culminating in a pipeline that converts distance maps into PBR-ready 3D assets for rendering, relighting, and dynamics. The main contributions are (1) OmniX with a unified formulation and adapter architecture, (2) the PanoX dataset with dense geometry and material maps, and (3) demonstrated panoramic perception and graphics-ready 3D scene generation across multiple tasks, validated on diverse datasets. The work enables immersive, photorealistic virtual environments and outlines practical integration with graphics pipelines, while acknowledging limitations in speed, surface accuracy for distance and metallic prediction, and generalization in some material channels, suggesting avenues for future improvement.

Abstract

There are two prevalent ways to constructing 3D scenes: procedural generation and 2D lifting. Among them, panorama-based 2D lifting has emerged as a promising technique, leveraging powerful 2D generative priors to produce immersive, realistic, and diverse 3D environments. In this work, we advance this technique to generate graphics-ready 3D scenes suitable for physically based rendering (PBR), relighting, and simulation. Our key insight is to repurpose 2D generative models for panoramic perception of geometry, textures, and PBR materials. Unlike existing 2D lifting approaches that emphasize appearance generation and ignore the perception of intrinsic properties, we present OmniX, a versatile and unified framework. Based on a lightweight and efficient cross-modal adapter structure, OmniX reuses 2D generative priors for a broad range of panoramic vision tasks, including panoramic perception, generation, and completion. Furthermore, we construct a large-scale synthetic panorama dataset containing high-quality multimodal panoramas from diverse indoor and outdoor scenes. Extensive experiments demonstrate the effectiveness of our model in panoramic visual perception and graphics-ready 3D scene generation, opening new possibilities for immersive and physically realistic virtual world generation.

OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes

TL;DR

OmniX addresses the challenge of building graphics-ready 3D scenes from panoramas by reusing pre-trained 2D flow-matching priors for unified panoramic generation, perception, and completion. It introduces a cross-modal Separate-Adapter design and the PanoX synthetic panorama dataset to enable RGB-to-X panoramas and intrinsic decomposition, culminating in a pipeline that converts distance maps into PBR-ready 3D assets for rendering, relighting, and dynamics. The main contributions are (1) OmniX with a unified formulation and adapter architecture, (2) the PanoX dataset with dense geometry and material maps, and (3) demonstrated panoramic perception and graphics-ready 3D scene generation across multiple tasks, validated on diverse datasets. The work enables immersive, photorealistic virtual environments and outlines practical integration with graphics pipelines, while acknowledging limitations in speed, surface accuracy for distance and metallic prediction, and generalization in some material channels, suggesting avenues for future improvement.

Abstract

There are two prevalent ways to constructing 3D scenes: procedural generation and 2D lifting. Among them, panorama-based 2D lifting has emerged as a promising technique, leveraging powerful 2D generative priors to produce immersive, realistic, and diverse 3D environments. In this work, we advance this technique to generate graphics-ready 3D scenes suitable for physically based rendering (PBR), relighting, and simulation. Our key insight is to repurpose 2D generative models for panoramic perception of geometry, textures, and PBR materials. Unlike existing 2D lifting approaches that emphasize appearance generation and ignore the perception of intrinsic properties, we present OmniX, a versatile and unified framework. Based on a lightweight and efficient cross-modal adapter structure, OmniX reuses 2D generative priors for a broad range of panoramic vision tasks, including panoramic perception, generation, and completion. Furthermore, we construct a large-scale synthetic panorama dataset containing high-quality multimodal panoramas from diverse indoor and outdoor scenes. Extensive experiments demonstrate the effectiveness of our model in panoramic visual perception and graphics-ready 3D scene generation, opening new possibilities for immersive and physically realistic virtual world generation.

Paper Structure

This paper contains 21 sections, 5 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: We present OmniX, a versatile and unified framework that repurposes pre-trained 2D flow matching models for panoramic perception, generation, and completion. This framework enables the construction of immersive, photorealistic, and graphics-compatible 3D scenes, suitable for physically-based rendering (PBR), relighting, and physical dynamics simulation.
  • Figure 2: A preview of the proposed PanoX dataset, providing high-quality panoramic rendered images with rich pixel-aligned annotations, including distance, world normal, albedo, roughness, and metallic. The dataset is collected from both indoor and outdoor scenes.
  • Figure 3: OmniX pipeline for panoramic generation and perception. Built on a pre-trained 2D flow matching model with flexible, modality-specific adapters, OmniX is capable of performing a wide range of panoramic vision tasks including generation, perception, and completion.
  • Figure 4: Different cross-modal adapter structures for multiple condition inputs $\{\mathbf{c}^i~|~i=0, 1, ...\}$ and multiple target outputs $\{\mathbf{\hat{z}}_1^j~|~j=0, 1, ...\}$. Specifically, (a) Shared-Branch concatenates different inputs along the channel dimension; (b) Shared-Adapter is equivalent to token-wise concatenation; (c) Separate-Adapter learns specific adapter weights for each type of input.
  • Figure 5: Occlusion-aware mask sampling. Based on the panoramic distance map and a randomly sampled 3D displacement, we can estimate the occluded regions by ray intersection. These regions are used as masks for training panoramic completion and guided panoramic perception models.
  • ...and 6 more figures