Table of Contents
Fetching ...

DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training

Haoran Feng, Dizhe Zhang, Xiangtai Li, Bo Du, Lu Qi

TL;DR

DiT360 introduces a hybrid training framework for panoramic image generation that jointly leverages limited panoramic data and abundant perspective images. By combining image-level regularization (panoramic refinement and perspective guidance) with token-level supervision (circular padding, yaw loss, cube loss), it achieves improved boundary continuity and image fidelity. Extensive experiments across text-to-panorama, inpainting, and outpainting demonstrate state-of-the-art performance on Matterport3D and strong qualitative results, with built-in capabilities for high-resolution generation. This approach provides a practical path toward photorealistic, geometrically faithful panoramas for AR/VR and 3D scene applications.

Abstract

In this work, we propose DiT360, a DiT-based framework that performs hybrid training on perspective and panoramic data for panoramic image generation. For the issues of maintaining geometric fidelity and photorealism in generation quality, we attribute the main reason to the lack of large-scale, high-quality, real-world panoramic data, where such a data-centric view differs from prior methods that focus on model design. Basically, DiT360 has several key modules for inter-domain transformation and intra-domain augmentation, applied at both the pre-VAE image level and the post-VAE token level. At the image level, we incorporate cross-domain knowledge through perspective image guidance and panoramic refinement, which enhance perceptual quality while regularizing diversity and photorealism. At the token level, hybrid supervision is applied across multiple modules, which include circular padding for boundary continuity, yaw loss for rotational robustness, and cube loss for distortion awareness. Extensive experiments on text-to-panorama, inpainting, and outpainting tasks demonstrate that our method achieves better boundary consistency and image fidelity across eleven quantitative metrics. Our code is available at https://github.com/Insta360-Research-Team/DiT360.

DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training

TL;DR

DiT360 introduces a hybrid training framework for panoramic image generation that jointly leverages limited panoramic data and abundant perspective images. By combining image-level regularization (panoramic refinement and perspective guidance) with token-level supervision (circular padding, yaw loss, cube loss), it achieves improved boundary continuity and image fidelity. Extensive experiments across text-to-panorama, inpainting, and outpainting demonstrate state-of-the-art performance on Matterport3D and strong qualitative results, with built-in capabilities for high-resolution generation. This approach provides a practical path toward photorealistic, geometrically faithful panoramas for AR/VR and 3D scene applications.

Abstract

In this work, we propose DiT360, a DiT-based framework that performs hybrid training on perspective and panoramic data for panoramic image generation. For the issues of maintaining geometric fidelity and photorealism in generation quality, we attribute the main reason to the lack of large-scale, high-quality, real-world panoramic data, where such a data-centric view differs from prior methods that focus on model design. Basically, DiT360 has several key modules for inter-domain transformation and intra-domain augmentation, applied at both the pre-VAE image level and the post-VAE token level. At the image level, we incorporate cross-domain knowledge through perspective image guidance and panoramic refinement, which enhance perceptual quality while regularizing diversity and photorealism. At the token level, hybrid supervision is applied across multiple modules, which include circular padding for boundary continuity, yaw loss for rotational robustness, and cube loss for distortion awareness. Extensive experiments on text-to-panorama, inpainting, and outpainting tasks demonstrate that our method achieves better boundary consistency and image fidelity across eleven quantitative metrics. Our code is available at https://github.com/Insta360-Research-Team/DiT360.

Paper Structure

This paper contains 34 sections, 10 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Visualization of DiT360's results. The shown examples include text-to-panorama generation, inpainting, and outpainting, together with comparisons against existing methods.
  • Figure 2: Overview of the DiT360 hybrid training pipeline. For the perspective branch, we employ (a) perspective image re-projection to transfer perspective knowledge to panoramic domain. For the panoramic branch, we first apply (b) panoramic refinement to remove polar blurring and then introduce (c) position-aware circular padding, (d) rotation-consistent yaw loss and (e) distortion-aware cube loss for token-level hybrid supervision.
  • Figure 3: Panoramic image refinement pipeline. The ERP panorama is converted into a cubemap, where pre-defined masks are applied to the central regions of the top and bottom faces. These masked regions are then reconstructed with an inpainting model and reprojected to ERP. In the figure, orange boxes represent blurry areas, and red dashed boxes indicate inpainted cubes.
  • Figure 4: Qualitative comparisons on panorama generation. The representative artifacts are highlighted with red boxes. More complete results are provided in \ref{['appendix:full_comparision']}.
  • Figure 5: Ablation results of different settings. Artifacts are marked by color-coded bounding boxes: red for spurious details, yellow for boundary discontinuities, and green for incorrect distortions.
  • ...and 5 more figures