Table of Contents
Fetching ...

JoPano: Unified Panorama Generation via Joint Modeling

Wancheng Feng, Chen An, Zhenliang He, Meina Kan, Shiguang Shan, Lukun Wang

TL;DR

JoPano tackles the dual challenges of panorama generation quality and efficiency by unifying text-to-panorama and view-to-panorama under a single diffusion framework. It introduces a Joint-Face Adapter that enables cross-face modeling across six cubemap faces, augmented with 3D spherical RoPE and adapter-only optimization to preserve pretrained capabilities. A condition-switching scheme unifies T2P and V2P within one model, while Poisson-based cross-face blending and new seam metrics mitigate boundary artifacts. Experimental results on Structure3D and SUN360 show state-of-the-art performance across FID, CLIP-FID, IS, and CLIP-Score for both tasks, with robust qualitative results and strong seam consistency. These contributions advance scalable, high-quality panorama generation with practical efficiency and flexibility for stylized outputs and multi-text prompts.

Abstract

Panorama generation has recently attracted growing interest in the research community, with two core tasks, text-to-panorama and view-to-panorama generation. However, existing methods still face two major challenges: their U-Net-based architectures constrain the visual quality of the generated panoramas, and they usually treat the two core tasks independently, which leads to modeling redundancy and inefficiency. To overcome these challenges, we propose a joint-face panorama (JoPano) generation approach that unifies the two core tasks within a DiT-based model. To transfer the rich generative capabilities of existing DiT backbones learned from natural images to the panorama domain, we propose a Joint-Face Adapter built on the cubemap representation of panoramas, which enables a pretrained DiT to jointly model and generate different views of a panorama. We further apply Poisson Blending to reduce seam inconsistencies that often appear at the boundaries between cube faces. Correspondingly, we introduce Seam-SSIM and Seam-Sobel metrics to quantitatively evaluate the seam consistency. Moreover, we propose a condition switching mechanism that unifies text-to-panorama and view-to-panorama tasks within a single model. Comprehensive experiments show that JoPano can generate high-quality panoramas for both text-to-panorama and view-to-panorama generation tasks, achieving state-of-the-art performance on FID, CLIP-FID, IS, and CLIP-Score metrics.

JoPano: Unified Panorama Generation via Joint Modeling

TL;DR

JoPano tackles the dual challenges of panorama generation quality and efficiency by unifying text-to-panorama and view-to-panorama under a single diffusion framework. It introduces a Joint-Face Adapter that enables cross-face modeling across six cubemap faces, augmented with 3D spherical RoPE and adapter-only optimization to preserve pretrained capabilities. A condition-switching scheme unifies T2P and V2P within one model, while Poisson-based cross-face blending and new seam metrics mitigate boundary artifacts. Experimental results on Structure3D and SUN360 show state-of-the-art performance across FID, CLIP-FID, IS, and CLIP-Score for both tasks, with robust qualitative results and strong seam consistency. These contributions advance scalable, high-quality panorama generation with practical efficiency and flexibility for stylized outputs and multi-text prompts.

Abstract

Panorama generation has recently attracted growing interest in the research community, with two core tasks, text-to-panorama and view-to-panorama generation. However, existing methods still face two major challenges: their U-Net-based architectures constrain the visual quality of the generated panoramas, and they usually treat the two core tasks independently, which leads to modeling redundancy and inefficiency. To overcome these challenges, we propose a joint-face panorama (JoPano) generation approach that unifies the two core tasks within a DiT-based model. To transfer the rich generative capabilities of existing DiT backbones learned from natural images to the panorama domain, we propose a Joint-Face Adapter built on the cubemap representation of panoramas, which enables a pretrained DiT to jointly model and generate different views of a panorama. We further apply Poisson Blending to reduce seam inconsistencies that often appear at the boundaries between cube faces. Correspondingly, we introduce Seam-SSIM and Seam-Sobel metrics to quantitatively evaluate the seam consistency. Moreover, we propose a condition switching mechanism that unifies text-to-panorama and view-to-panorama tasks within a single model. Comprehensive experiments show that JoPano can generate high-quality panoramas for both text-to-panorama and view-to-panorama generation tasks, achieving state-of-the-art performance on FID, CLIP-FID, IS, and CLIP-Score metrics.

Paper Structure

This paper contains 40 sections, 15 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: We propose JoPano, a unified panorama generation framework that supports both text-to-panorama (T2P) and view-to-panorama (V2P). The left eight examples show T2P results, while the right eight show V2P results. JoPano generates high-quality panoramas across indoor, outdoor, and stylized scenes.
  • Figure 2: Overview of the JoPano pipeline. (a)Training process. The Joint-Face Adapter is inserted into Sana-DiT to jointly model all six cubemap faces, and a single diffusion process is shared by T2P and V2P. (b) Inference process. The Joint-Face DiT generates the cubemap faces, and the Cross-Face Blender further refines the results across faces.
  • Figure 3: Comparison of JoPano with other T2P methods. The first row shows outdoor scenes, the second row shows indoor scenes, and the third row shows a stylized scene, all generated from text prompts.
  • Figure 4: Comparison of JoPano with other V2P methods. The first row shows outdoor scenes, the second row shows indoor scenes, and the third row shows a stylized scene, all generated from view conditions.
  • Figure 5: Comparison of ERP panoramas with and without Cross-Face Blending (CFB). The first row shows the panorama before CFB, with visible seam artifacts (red box). The second row shows the panorama after applying CFB, where the seams are smoothed. The improvement is more apparent in the gradient fields.
  • ...and 7 more figures