Table of Contents
Fetching ...

PanoWan: Lifting Diffusion Video Generation Models to 360° with Latitude/Longitude-aware Mechanisms

Yifei Xia, Shuchen Weng, Siqi Yang, Jingqi Liu, Chengxuan Zhu, Minggui Teng, Zijian Jia, Han Jiang, Boxin Shi

TL;DR

PanoWan addresses the challenge of generating high-quality 360° panoramic videos by lifting priors from a pre-trained text-to-video diffusion model to the panorama with minimal, efficient modules. It introduces latitude-aware sampling to mitigate latitudinal distortion, rotated semantic denoising to achieve seamless longitude transitions, and padded pixel-wise decoding to reduce boundary artifacts, all supported by the newly released PanoVid dataset. The approach achieves state-of-the-art panoramic generation metrics and robust zero-shot performance on downstream tasks, while enabling practical editing and long-video generation. This work narrows the gap between conventional priors and panoramic geometry, enabling scalable, coherent 360° content creation from text descriptions.

Abstract

Panoramic video generation enables immersive 360° content creation, valuable in applications that demand scene-consistent world exploration. However, existing panoramic video generation models struggle to leverage pre-trained generative priors from conventional text-to-video models for high-quality and diverse panoramic videos generation, due to limited dataset scale and the gap in spatial feature representations. In this paper, we introduce PanoWan to effectively lift pre-trained text-to-video models to the panoramic domain, equipped with minimal modules. PanoWan employs latitude-aware sampling to avoid latitudinal distortion, while its rotated semantic denoising and padded pixel-wise decoding ensure seamless transitions at longitude boundaries. To provide sufficient panoramic videos for learning these lifted representations, we contribute PanoVid, a high-quality panoramic video dataset with captions and diverse scenarios. Consequently, PanoWan achieves state-of-the-art performance in panoramic video generation and demonstrates robustness for zero-shot downstream tasks. Our project page is available at https://panowan.variantconst.com.

PanoWan: Lifting Diffusion Video Generation Models to 360° with Latitude/Longitude-aware Mechanisms

TL;DR

PanoWan addresses the challenge of generating high-quality 360° panoramic videos by lifting priors from a pre-trained text-to-video diffusion model to the panorama with minimal, efficient modules. It introduces latitude-aware sampling to mitigate latitudinal distortion, rotated semantic denoising to achieve seamless longitude transitions, and padded pixel-wise decoding to reduce boundary artifacts, all supported by the newly released PanoVid dataset. The approach achieves state-of-the-art panoramic generation metrics and robust zero-shot performance on downstream tasks, while enabling practical editing and long-video generation. This work narrows the gap between conventional priors and panoramic geometry, enabling scalable, coherent 360° content creation from text descriptions.

Abstract

Panoramic video generation enables immersive 360° content creation, valuable in applications that demand scene-consistent world exploration. However, existing panoramic video generation models struggle to leverage pre-trained generative priors from conventional text-to-video models for high-quality and diverse panoramic videos generation, due to limited dataset scale and the gap in spatial feature representations. In this paper, we introduce PanoWan to effectively lift pre-trained text-to-video models to the panoramic domain, equipped with minimal modules. PanoWan employs latitude-aware sampling to avoid latitudinal distortion, while its rotated semantic denoising and padded pixel-wise decoding ensure seamless transitions at longitude boundaries. To provide sufficient panoramic videos for learning these lifted representations, we contribute PanoVid, a high-quality panoramic video dataset with captions and diverse scenarios. Consequently, PanoWan achieves state-of-the-art performance in panoramic video generation and demonstrates robustness for zero-shot downstream tasks. Our project page is available at https://panowan.variantconst.com.

Paper Structure

This paper contains 20 sections, 14 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: PanoWan is a text-based panoramic video generation framework. It lifts pre-trained generative priors from a conventional text-to-video model to the panorama, and enables generating diverse scenarios for long videos. Equipped with training-free techniques, PanoWan supports zero-shot editing of panoramic videos, including super-resolution, semantic editing, and video outpainting.
  • Figure 2: The pipeline of our proposed PanoWan, aware of spherical coordinates. To avoid latitudinal distortion, initial random Gaussian noise is remapped to align with the spherical frequency distribution using the latitude-aware sampling (\ref{['sec:latitude']}). Next, this remapped noise serves as the latent code within the VAE-encoded latent space. A DiT-based denoising network then iteratively refines this latent representation, where rotated denoising is applied by rolling the latent grid to ensure semantic consistency across longitudinal boundaries. After that, padded pixel-wise decoding provides the VAE decoder with extended context, enabling the mapping of the denoised latent code back into seamless panoramic videos (\ref{['sec:longitude']}). The DiT backbone within PanoWan is efficiently fine-tuned using LoRA, where most parameters of the pre-trained text-to-video model remain frozen to preserve its strong generative priors.
  • Figure 3: Visual comparison results with existing text-based panoramic video generation methods.
  • Figure 4: Qualitative evaluation of proposed latitude/longitude-aware mechanisms. (a) With the proposed Latitude-Aware Sampling (LAS), PanoWan ensures that content generated at high latitudes exhibits an accurate geometry when presented in a perspective view. (b) By combining Rotated Semantic Denoising (RSD) and Padded Pixel-wise Decoding (PPD), PanoWan achieves seamless longitude transitions. For visualization, videos are rolled 180$^\circ$ to center the seam.
  • Figure 5: Additional comparison results with existing text-based panoramic video generation methods.
  • ...and 4 more figures