VideoPanda: Video Panoramic Diffusion with Multi-view Attention
Kevin Xie, Amirmojtaba Sabour, Jiahui Huang, Despoina Paschalidou, Greg Klar, Umar Iqbal, Sanja Fidler, Xiaohui Zeng
TL;DR
VideoPanda tackles the generation of high-resolution $360^\circ$ panoramic videos conditioned on text or single-view video by extending a pretrained video diffusion model with 3D multi-view self-attention and ray-direction embeddings. The approach trains in two stages with randomization of view-frame configurations (random matrix) and supports multi-task conditioning (text, video, autoregressive) plus a shifted $v$-prediction objective, enabling autoregressive long-video generation. Across real and synthetic datasets, VideoPanda achieves superior realism, temporal coherence, and prompt alignment compared to baselines like 360DVD and MVDiffusion, validated through FID/FVD, PSNR/SSIM/LPIPS, Clip scores, and user studies. The method's ability to synthesize consistent multi-view panoramas and stitch them into immersive $360^\circ$ content offers scalable pathways for VR content creation, with future directions including dynamic scene understanding, more accurate conditioning parameter estimation, and extension to stronger base video models.
Abstract
High resolution panoramic video content is paramount for immersive experiences in Virtual Reality, but is non-trivial to collect as it requires specialized equipment and intricate camera setups. In this work, we introduce VideoPanda, a novel approach for synthesizing 360$^\circ$ videos conditioned on text or single-view video data. VideoPanda leverages multi-view attention layers to augment a video diffusion model, enabling it to generate consistent multi-view videos that can be combined into immersive panoramic content. VideoPanda is trained jointly using two conditions: text-only and single-view video, and supports autoregressive generation of long-videos. To overcome the computational burden of multi-view video generation, we randomly subsample the duration and camera views used during training and show that the model is able to gracefully generalize to generating more frames during inference. Extensive evaluations on both real-world and synthetic video datasets demonstrate that VideoPanda generates more realistic and coherent 360$^\circ$ panoramas across all input conditions compared to existing methods. Visit the project website at https://research.nvidia.com/labs/toronto-ai/VideoPanda/ for results.
