Table of Contents
Fetching ...

VideoPanda: Video Panoramic Diffusion with Multi-view Attention

Kevin Xie, Amirmojtaba Sabour, Jiahui Huang, Despoina Paschalidou, Greg Klar, Umar Iqbal, Sanja Fidler, Xiaohui Zeng

TL;DR

VideoPanda tackles the generation of high-resolution $360^\circ$ panoramic videos conditioned on text or single-view video by extending a pretrained video diffusion model with 3D multi-view self-attention and ray-direction embeddings. The approach trains in two stages with randomization of view-frame configurations (random matrix) and supports multi-task conditioning (text, video, autoregressive) plus a shifted $v$-prediction objective, enabling autoregressive long-video generation. Across real and synthetic datasets, VideoPanda achieves superior realism, temporal coherence, and prompt alignment compared to baselines like 360DVD and MVDiffusion, validated through FID/FVD, PSNR/SSIM/LPIPS, Clip scores, and user studies. The method's ability to synthesize consistent multi-view panoramas and stitch them into immersive $360^\circ$ content offers scalable pathways for VR content creation, with future directions including dynamic scene understanding, more accurate conditioning parameter estimation, and extension to stronger base video models.

Abstract

High resolution panoramic video content is paramount for immersive experiences in Virtual Reality, but is non-trivial to collect as it requires specialized equipment and intricate camera setups. In this work, we introduce VideoPanda, a novel approach for synthesizing 360$^\circ$ videos conditioned on text or single-view video data. VideoPanda leverages multi-view attention layers to augment a video diffusion model, enabling it to generate consistent multi-view videos that can be combined into immersive panoramic content. VideoPanda is trained jointly using two conditions: text-only and single-view video, and supports autoregressive generation of long-videos. To overcome the computational burden of multi-view video generation, we randomly subsample the duration and camera views used during training and show that the model is able to gracefully generalize to generating more frames during inference. Extensive evaluations on both real-world and synthetic video datasets demonstrate that VideoPanda generates more realistic and coherent 360$^\circ$ panoramas across all input conditions compared to existing methods. Visit the project website at https://research.nvidia.com/labs/toronto-ai/VideoPanda/ for results.

VideoPanda: Video Panoramic Diffusion with Multi-view Attention

TL;DR

VideoPanda tackles the generation of high-resolution panoramic videos conditioned on text or single-view video by extending a pretrained video diffusion model with 3D multi-view self-attention and ray-direction embeddings. The approach trains in two stages with randomization of view-frame configurations (random matrix) and supports multi-task conditioning (text, video, autoregressive) plus a shifted -prediction objective, enabling autoregressive long-video generation. Across real and synthetic datasets, VideoPanda achieves superior realism, temporal coherence, and prompt alignment compared to baselines like 360DVD and MVDiffusion, validated through FID/FVD, PSNR/SSIM/LPIPS, Clip scores, and user studies. The method's ability to synthesize consistent multi-view panoramas and stitch them into immersive content offers scalable pathways for VR content creation, with future directions including dynamic scene understanding, more accurate conditioning parameter estimation, and extension to stronger base video models.

Abstract

High resolution panoramic video content is paramount for immersive experiences in Virtual Reality, but is non-trivial to collect as it requires specialized equipment and intricate camera setups. In this work, we introduce VideoPanda, a novel approach for synthesizing 360 videos conditioned on text or single-view video data. VideoPanda leverages multi-view attention layers to augment a video diffusion model, enabling it to generate consistent multi-view videos that can be combined into immersive panoramic content. VideoPanda is trained jointly using two conditions: text-only and single-view video, and supports autoregressive generation of long-videos. To overcome the computational burden of multi-view video generation, we randomly subsample the duration and camera views used during training and show that the model is able to gracefully generalize to generating more frames during inference. Extensive evaluations on both real-world and synthetic video datasets demonstrate that VideoPanda generates more realistic and coherent 360 panoramas across all input conditions compared to existing methods. Visit the project website at https://research.nvidia.com/labs/toronto-ai/VideoPanda/ for results.

Paper Structure

This paper contains 37 sections, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Generated samples conditioned on a single-view video and text prompt. Both single-view video inputs were generated using existing video generation models sorarunway. Auto-regressive generation is applied to extend the video length.
  • Figure 2: We divide the equi-rectangular video into 8 perspective views via projection. Our diffusion model consists of interleaved spatial, multi-view, and temporal blocks, conditioned on text prompts. Attention is used to propagate information through the multi-view videos to ensure consistency. The input views are embedded using the ray directions as visualized by the color map behind the perspective images.
  • Figure 3: The model is trained using three frame conditioning regimes. (a) No image conditions and the initial inputs are pure noise; (b) Conditioning only on the first view of the video; (c) Conditioning on the first frame and first views for auto-regressive video generation. At inference time, we autoregressively condition on long videos by using conditioning (b) to generate the first window and subsequently using the last multi-view images row from the previous time step (the shaded region) as the first row input to our model using condition-type (c).
  • Figure 4: Qualitative figure compare text conditional video generation, 360DVD VS ours. The pixel quality of 360DVD is lower and distortion near the poles (top and bottom) is worse.
  • Figure 5: Qualitative figure comparing video conditional generation, MVDiffusion VS ours. Note that MVDiffusion can only outpaint each frame of the video separately. MVDiffusion is worse at maintaining the structure and style of the input view globally compared to ours. For example the sky color and the scales and depths of objects is less consistent for MVDiffusion.
  • ...and 10 more figures