Table of Contents
Fetching ...

Beyond the Frame: Generating 360° Panoramic Videos from Perspective Videos

Rundong Luo, Matthew Wallingford, Ali Farhadi, Noah Snavely, Wei-Chiu Ma

TL;DR

Argus tackles the challenge of generating realistic and temporally coherent 360° panoramic videos from single-view perspective inputs by formulating a diffusion-based video-to-360° framework conditioned on projected equirectangular representations. Key innovations include view-based frame alignment, camera motion simulation, and blended decoding, all trained on a large curated 360° video dataset with a height-weighted score-matching objective. The approach demonstrates superior spatial coherence, temporal stability, and geometric plausibility versus adapted baselines, enabling practical applications such as video stabilization, dynamic viewpoint control, environment mapping, and interactive visual question answering. By leveraging abundant 360° priors and geometry-aware learning, Argus advances panoramic video generation for real-world, in-the-wild scenarios.

Abstract

360° videos have emerged as a promising medium to represent our dynamic visual world. Compared to the "tunnel vision" of standard cameras, their borderless field of view offers a more complete perspective of our surroundings. While existing video models excel at producing standard videos, their ability to generate full panoramic videos remains elusive. In this paper, we investigate the task of video-to-360° generation: given a perspective video as input, our goal is to generate a full panoramic video that is consistent with the original video. Unlike conventional video generation tasks, the output's field of view is significantly larger, and the model is required to have a deep understanding of both the spatial layout of the scene and the dynamics of objects to maintain spatio-temporal consistency. To address these challenges, we first leverage the abundant 360° videos available online and develop a high-quality data filtering pipeline to curate pairwise training data. We then carefully design a series of geometry- and motion-aware operations to facilitate the learning process and improve the quality of 360° video generation. Experimental results demonstrate that our model can generate realistic and coherent 360° videos from in-the-wild perspective video. In addition, we showcase its potential applications, including video stabilization, camera viewpoint control, and interactive visual question answering.

Beyond the Frame: Generating 360° Panoramic Videos from Perspective Videos

TL;DR

Argus tackles the challenge of generating realistic and temporally coherent 360° panoramic videos from single-view perspective inputs by formulating a diffusion-based video-to-360° framework conditioned on projected equirectangular representations. Key innovations include view-based frame alignment, camera motion simulation, and blended decoding, all trained on a large curated 360° video dataset with a height-weighted score-matching objective. The approach demonstrates superior spatial coherence, temporal stability, and geometric plausibility versus adapted baselines, enabling practical applications such as video stabilization, dynamic viewpoint control, environment mapping, and interactive visual question answering. By leveraging abundant 360° priors and geometry-aware learning, Argus advances panoramic video generation for real-world, in-the-wild scenarios.

Abstract

360° videos have emerged as a promising medium to represent our dynamic visual world. Compared to the "tunnel vision" of standard cameras, their borderless field of view offers a more complete perspective of our surroundings. While existing video models excel at producing standard videos, their ability to generate full panoramic videos remains elusive. In this paper, we investigate the task of video-to-360° generation: given a perspective video as input, our goal is to generate a full panoramic video that is consistent with the original video. Unlike conventional video generation tasks, the output's field of view is significantly larger, and the model is required to have a deep understanding of both the spatial layout of the scene and the dynamics of objects to maintain spatio-temporal consistency. To address these challenges, we first leverage the abundant 360° videos available online and develop a high-quality data filtering pipeline to curate pairwise training data. We then carefully design a series of geometry- and motion-aware operations to facilitate the learning process and improve the quality of 360° video generation. Experimental results demonstrate that our model can generate realistic and coherent 360° videos from in-the-wild perspective video. In addition, we showcase its potential applications, including video stabilization, camera viewpoint control, and interactive visual question answering.

Paper Structure

This paper contains 22 sections, 12 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: 360$^\circ$ videos generated by our model, Argus$^\dag$. Starting from an input perspective video with arbitrary camera motion (red box), Argus generates a full 360$^\circ$ panoramic video (visualized as environmental maps), where the red box indicates the input view in the generated frame. The blue, orange, and purple boxes show sampled perspectives from the generated 360$^\circ$ video. Best viewed in Adobe Acrobat Reader for the embedded videos.
  • Figure 2: View-based frame alignment. Given input perspective video frames (first row), we project them onto shared coordinates to ensure a consistent viewing direction (second row). Without alignment, placing all video frames at the center (third row) forces the model to learn varying scene arrangements (e.g., the sky appearing at different heights), complicating the learning process.
  • Figure 3: Blended decoding. We blend the video decoded from the original and 180$^\circ$-rotated latents to ensure boundary consistency. Zoom in to see the artifacts on the bottom-right image.
  • Figure 4: Qualitative comparison with 360$^\circ$ image generation method PanoDiffusion (videos embedded). The input region is highlighted in red, with orange and blue regions indicate extracted perspective views. Although PanoDiffusion can generate plausible 360$^\circ$ images from perspective inputs, the generated frames are temporally inconsistent.
  • Figure 5: Qualitative comparison with state-of-the-art video outpainting method. The input region is highlighted in orange. For each generated 360$^\circ$ frame, four unwrapped perspective views are shown on the right. Video outpainting method struggles with satisfying 360$^\circ$ panoramic property and the generation quality declines as it extends further from the input viewpoint.
  • ...and 12 more figures