Table of Contents
Fetching ...

SphereDiff: Tuning-free 360° Static and Dynamic Panorama Generation via Spherical Latent Representation

Minho Park, Taewoong Kang, Jooyeol Yun, Sungwon Hwang, Jaegul Choo

TL;DR

SphereDiff tackles the distortions inherent in 360° panorama generation by introducing a spherical latent representation distributed via a Fibonacci lattice, ensuring uniform coverage and consistent quality across all view directions including poles. By extending MultiDiffusion to operate in this spherical latent space, applying dynamic latent sampling to map latents to perspective views, and employing distortion-aware weighted averaging and multi-prompt inference, SphereDiff achieves tuning-free generation of high-quality static and live 360° wallpapers. It demonstrates state-of-the-art performance on panoramic criteria (distortion and end continuity) while remaining compatible with different diffusion backbones, highlighting its potential as a robust foundation for immersive AR/VR content. Limitations include indoor scene generation and data constraints for panoramic video, with runtime and diffusion backbone considerations noted, pointing to future improvements through stronger backbones and broader datasets.

Abstract

The increasing demand for AR/VR applications has highlighted the need for high-quality content, such as 360° live wallpapers. However, generating high-quality 360° panoramic contents remains a challenging task due to the severe distortions introduced by equirectangular projection (ERP). Existing approaches either fine-tune pretrained diffusion models on limited ERP datasets or adopt tuning-free methods that still rely on ERP latent representations, often resulting in distracting distortions near the poles. In this paper, we introduce SphereDiff, a novel approach for synthesizing 360° static and live wallpaper with state-of-the-art diffusion models without additional tuning. We define a spherical latent representation that ensures consistent quality across all perspectives, including near the poles. Then, we extend MultiDiffusion to spherical latent representation and propose a dynamic spherical latent sampling method to enable direct use of pretrained diffusion models. Moreover, we introduce distortion-aware weighted averaging to further improve the generation quality. Our method outperforms existing approaches in generating 360° static and live wallpaper, making it a robust solution for immersive AR/VR applications. The code is available here. https://github.com/pmh9960/SphereDiff

SphereDiff: Tuning-free 360° Static and Dynamic Panorama Generation via Spherical Latent Representation

TL;DR

SphereDiff tackles the distortions inherent in 360° panorama generation by introducing a spherical latent representation distributed via a Fibonacci lattice, ensuring uniform coverage and consistent quality across all view directions including poles. By extending MultiDiffusion to operate in this spherical latent space, applying dynamic latent sampling to map latents to perspective views, and employing distortion-aware weighted averaging and multi-prompt inference, SphereDiff achieves tuning-free generation of high-quality static and live 360° wallpapers. It demonstrates state-of-the-art performance on panoramic criteria (distortion and end continuity) while remaining compatible with different diffusion backbones, highlighting its potential as a robust foundation for immersive AR/VR content. Limitations include indoor scene generation and data constraints for panoramic video, with runtime and diffusion backbone considerations noted, pointing to future improvements through stronger backbones and broader datasets.

Abstract

The increasing demand for AR/VR applications has highlighted the need for high-quality content, such as 360° live wallpapers. However, generating high-quality 360° panoramic contents remains a challenging task due to the severe distortions introduced by equirectangular projection (ERP). Existing approaches either fine-tune pretrained diffusion models on limited ERP datasets or adopt tuning-free methods that still rely on ERP latent representations, often resulting in distracting distortions near the poles. In this paper, we introduce SphereDiff, a novel approach for synthesizing 360° static and live wallpaper with state-of-the-art diffusion models without additional tuning. We define a spherical latent representation that ensures consistent quality across all perspectives, including near the poles. Then, we extend MultiDiffusion to spherical latent representation and propose a dynamic spherical latent sampling method to enable direct use of pretrained diffusion models. Moreover, we introduce distortion-aware weighted averaging to further improve the generation quality. Our method outperforms existing approaches in generating 360° static and live wallpaper, making it a robust solution for immersive AR/VR applications. The code is available here. https://github.com/pmh9960/SphereDiff

Paper Structure

This paper contains 65 sections, 14 equations, 17 figures, 5 tables, 1 algorithm.

Figures (17)

  • Figure 1: SphereDiff enables tuning-free 360° panorama generation via spherical latent. It is compatible with various diffusion backbones, including FLUX flux2024, SANA xie2024sana, and HunyuanVideo kong2024hunyuanvideo.
  • Figure 2: Motivation. Both ERP-based finetuning 360_lorawang2024360dvd and tuning-free liu2024dynamicscaler approaches often fail to generate seamless scenes near the poles, as their latents are unevenly distributed over the spherical surface. In contrast, our method produces seamless results by leveraging a spherical latent representation.
  • Figure 3: Overall Pipeline. We begin by initializing uniformly distributed spherical latents. Next, we map these latents to perspective latents corresponding to multiple view directions. Each view is then denoised using its corresponding prompt. The denoised views are subsequently fused via distortion-aware weighted averaging.
  • Figure 4: Comparison of Nearest and Dynamic Sampling. Nearest sampling often resamples the selected latents or omits central ones, while dynamic sampling selects latents from the center outward, discarding only the outermost ones.
  • Figure 5: Qualitative comparison. Each sample shows perspective views from top to bottom, highlighting end-to-end continuity and distortion. Other methods exhibit artifacts such as seams, pole distortions, blurriness, or spots, while ours produces seamless, high-quality panoramas without these issues. The entire ERPs are available in \ref{['appn:comparison']}.
  • ...and 12 more figures