Table of Contents
Fetching ...

VidPanos: Generative Panoramic Videos from Casual Panning Videos

Jingwei Ma, Erika Lu, Roni Paiss, Shiran Zada, Aleksander Holynski, Tali Dekel, Brian Curless, Michael Rubinstein, Forrester Cole

TL;DR

This work presents a method for synthesizing a panoramic video from a casually-captured panning video, as if the original video were captured with a wide-angle camera, and applies video generation as a component of the panorama synthesis system, and demonstrates how to exploit the strengths of the models while minimizing their limitations.

Abstract

Panoramic image stitching provides a unified, wide-angle view of a scene that extends beyond the camera's field of view. Stitching frames of a panning video into a panoramic photograph is a well-understood problem for stationary scenes, but when objects are moving, a still panorama cannot capture the scene. We present a method for synthesizing a panoramic video from a casually-captured panning video, as if the original video were captured with a wide-angle camera. We pose panorama synthesis as a space-time outpainting problem, where we aim to create a full panoramic video of the same length as the input video. Consistent completion of the space-time volume requires a powerful, realistic prior over video content and motion, for which we adapt generative video models. Existing generative models do not, however, immediately extend to panorama completion, as we show. We instead apply video generation as a component of our panorama synthesis system, and demonstrate how to exploit the strengths of the models while minimizing their limitations. Our system can create video panoramas for a range of in-the-wild scenes including people, vehicles, and flowing water, as well as stationary background features.

VidPanos: Generative Panoramic Videos from Casual Panning Videos

TL;DR

This work presents a method for synthesizing a panoramic video from a casually-captured panning video, as if the original video were captured with a wide-angle camera, and applies video generation as a component of the panorama synthesis system, and demonstrates how to exploit the strengths of the models while minimizing their limitations.

Abstract

Panoramic image stitching provides a unified, wide-angle view of a scene that extends beyond the camera's field of view. Stitching frames of a panning video into a panoramic photograph is a well-understood problem for stationary scenes, but when objects are moving, a still panorama cannot capture the scene. We present a method for synthesizing a panoramic video from a casually-captured panning video, as if the original video were captured with a wide-angle camera. We pose panorama synthesis as a space-time outpainting problem, where we aim to create a full panoramic video of the same length as the input video. Consistent completion of the space-time volume requires a powerful, realistic prior over video content and motion, for which we adapt generative video models. Existing generative models do not, however, immediately extend to panorama completion, as we show. We instead apply video generation as a component of our panorama synthesis system, and demonstrate how to exploit the strengths of the models while minimizing their limitations. Our system can create video panoramas for a range of in-the-wild scenes including people, vehicles, and flowing water, as well as stationary background features.

Paper Structure

This paper contains 44 sections, 3 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: Temporal coarse-to-fine. The input video (a) is projected on to a unified panoramic canvas using estimated camera parameters. The reprojected input video (b) is temporally downsampled with temporal prefiltering. A base panoramic video is synthesized at the coarsest temporal scale (top), then gradually refined by temporal upsampling, merging, and resynthesis (c). Finally, a spatial super-resolution pass is applied and the original input pixels are merged with the result to produce the output video (d).
  • Figure 2: Upsampling and outpainting. The completed panorama from the previous level $\mathbf{y}^{k+1}$ (a) is temporally-upsampled and composited with the current level input video $\mathbf{x}^k$ to form a partially-completed input $\hat{\mathbf{y}}^k_{merge}$ (b, input pixels shown highlighted). The model uses the full $\hat{\mathbf{y}}^k_{merge}$ for context and resynthesizes content outside the input mask to complete the next level panorama $\mathbf{y}^{k}$ (c). In the time dimension, the model is applied in a sliding-window fashion with half-window overlap. In the spatial dimension, multiple overlapping predictions are computed in parallel, then aggregated and a sample is drawn from the average (d).
  • Figure 3: Spatial aggregation of predicted distributions. To generate a sample in the overlap (red), we linearly interpolate the two predicted probability distributions (purple, orange) and sample from the aggregated distribution (brown). With a token-based method the distribution is a discrete distribution over the vocabulary. With diffusion, the distribution is a Gaussian distribution over pixel values, represented by $\mu$ and $\Sigma$.
  • Figure 4: Comparison with baseline methods. From top to bottom: linear interpolation between pixels based on time produces sharp results for stationary regions, but does not interpolate motion. ProPainter zhou2023propainter and E$^2$FGVI liCvpr22vInpainting are flow-based methods that can produce realistic results in stationary regions (scuba, Bangkok), but fail for moving cameras (skate, ski) or moving objects away from the input window (divers on left in scuba). MAGVIT yu2023magvit is a video-generation method but does not generate on a common panorama canvas, so it loses information away from the input window. Our results use a coarse-to-fine approach to build a consistent panoramic video and better match the ground-truth. Bottom: ground truth video with input window marked in yellow. See supplemental material for video results.
  • Figure 5: Comparison with Panoramic Video Textures agarwala2005panoramic. PVT uses a graph-cut formulation to create a looping panoramic video. Our method can create similar videos, but can also include non-stationary features like the person walking behind the waterfall (boxed).
  • ...and 8 more figures