360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model

Qian Wang; Weiqi Li; Chong Mou; Xinhua Cheng; Jian Zhang

360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model

Qian Wang, Weiqi Li, Chong Mou, Xinhua Cheng, Jian Zhang

TL;DR

The paper addresses the challenge of generating high-quality 360° panorama videos without expensive capture by introducing 360DVD, a diffusion-based pipeline that repurposes pre-trained text-to-video models through a lightweight 360-Adapter. It leverages a new WEB360 dataset and a 360 Text Fusion captioning strategy to train the model on panorama-specific content, with enhancement techniques such as latitude-aware loss and wraparound-consistent mechanisms to improve continuity and motion fidelity. The results show that 360DVD produces text-aligned, coherent panorama videos across multiple styles and can follow motion guidance from optical flow while maintaining content distribution consistent with real panoramas. This work enables versatile, prompt-driven panorama video generation suitable for VR, entertainment, and educational applications, while preserving priors learned by large diffusion models and allowing easy adaptation to personalized T2I models.

Abstract

Panorama video recently attracts more interest in both study and application, courtesy of its immersive experience. Due to the expensive cost of capturing 360-degree panoramic videos, generating desirable panorama videos by prompts is urgently required. Lately, the emerging text-to-video (T2V) diffusion methods demonstrate notable effectiveness in standard video generation. However, due to the significant gap in content and motion patterns between panoramic and standard videos, these methods encounter challenges in yielding satisfactory 360-degree panoramic videos. In this paper, we propose a pipeline named 360-Degree Video Diffusion model (360DVD) for generating 360-degree panoramic videos based on the given prompts and motion conditions. Specifically, we introduce a lightweight 360-Adapter accompanied by 360 Enhancement Techniques to transform pre-trained T2V models for panorama video generation. We further propose a new panorama dataset named WEB360 consisting of panoramic video-text pairs for training 360DVD, addressing the absence of captioned panoramic video datasets. Extensive experiments demonstrate the superiority and effectiveness of 360DVD for panorama video generation. Our project page is at https://akaneqwq.github.io/360DVD/.

360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model

TL;DR

Abstract

Paper Structure (17 sections, 10 equations, 8 figures, 1 table)

This paper contains 17 sections, 10 equations, 8 figures, 1 table.

Introduction
Related Works
Text-to-Image Diffusion Model
Text-to-Video Diffusion Model
Panorama Generation
Method
Preliminaries
WEB360 Dataset
360-degree Video Diffusion Model
360 Enhancement Techniques
Experiment
Implementation Details
Qualitative Results
Comparison
Ablation Study
...and 2 more sections

Figures (8)

Figure 1: Main results. Our 360DVD creates text-aligned, coherent, and high-quality $360^{\circ}$ panorama videos. Furthermore, 360DVD can cooperate with multiple personalized text-to-image models and consistently generate stylized panorama videos.
Figure 2: 360 Text Fusion. The captions of four images with a FoV of 90 are fed into ChatGPT to generate a new $360^{\circ}$ summarization. Compared to the caption of ERP at the bottom right, 360 Text Fusion allows for more fine-grained captions.
Figure 3: Overview of 360DVD. 360DVD leverages a trainable 360-Adapter to extend standard T2V models to the panorama domain and is able to generate high-quality panorama videos with given prompts and optional motion conditions. In addition, 360 Enhancement Techniques are proposed for quality improvement in the panorama perspective.
Figure 4: Overview of 360-Adapter. 360-Adapter is a simple but effective module in which intermediate features are fed into the U-Net encoder blocks for modulation.
Figure 5: Left: the visualization of weight matrix $\mathbf{W}$, brighter colors indicate values closer to 1, while darker colors suggest values closer to 0. Right: a schematic diagram of the latent rotation mechanism. In each iteration, the far left portion of angle $\theta$ is shifted to the far right.
...and 3 more figures

360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model

TL;DR

Abstract

360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model

Authors

TL;DR

Abstract

Table of Contents

Figures (8)