Table of Contents
Fetching ...

DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes

Jinxiu Liu, Shaoheng Lin, Yinxiao Li, Ming-Hsuan Yang

TL;DR

DynamicScaler tackles the challenge of generating coherent, long-duration panoramic videos with arbitrary resolutions and aspect ratios without model fine-tuning. It introduces Offset Shifting Denoising to evenly denoise a full panorama via shifting windows, and Global Motion Guidance to preserve global motion structure while refining local details through a hierarchical upsampling path. A Panoramic Projection Denoiser enables efficient 360° FoV outputs by projecting ERP latents into perspective viewports for denoising, with spherical mappings to maintain geometric fidelity. Temporal extension further yields long-duration and loopable panoramic videos, overcoming memory constraints of standard diffusion models. Overall, the method demonstrates superior visual quality, motion coherence, and scalability for immersive AR/VR content generation.

Abstract

The increasing demand for immersive AR/VR applications and spatial intelligence has heightened the need to generate high-quality scene-level and 360$°$ panoramic video. However, most video diffusion models are constrained by limited resolution and aspect ratio, which restricts their applicability to scene-level dynamic content synthesis. In this work, we propose $\textbf{DynamicScaler}$, addressing these challenges by enabling spatially scalable and panoramic dynamic scene synthesis that preserves coherence across panoramic scenes of arbitrary size. Specifically, we introduce a Offset Shifting Denoiser, facilitating efficient, synchronous, and coherent denoising panoramic dynamic scenes via a diffusion model with fixed resolution through a seamless rotating Window, which ensures seamless boundary transitions and consistency across the entire panoramic space, accommodating varying resolutions and aspect ratios. Additionally, we employ a Global Motion Guidance mechanism to ensure both local detail fidelity and global motion continuity. Extensive experiments demonstrate our method achieves superior content and motion quality in panoramic scene-level video generation, offering a training-free, efficient, and scalable solution for immersive dynamic scene creation with constant VRAM consumption regardless of the output video resolution. Project page is available at $\href{https://dynamic-scaler.pages.dev/new}{https://dynamic-scaler.pages.dev/new}$.

DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes

TL;DR

DynamicScaler tackles the challenge of generating coherent, long-duration panoramic videos with arbitrary resolutions and aspect ratios without model fine-tuning. It introduces Offset Shifting Denoising to evenly denoise a full panorama via shifting windows, and Global Motion Guidance to preserve global motion structure while refining local details through a hierarchical upsampling path. A Panoramic Projection Denoiser enables efficient 360° FoV outputs by projecting ERP latents into perspective viewports for denoising, with spherical mappings to maintain geometric fidelity. Temporal extension further yields long-duration and loopable panoramic videos, overcoming memory constraints of standard diffusion models. Overall, the method demonstrates superior visual quality, motion coherence, and scalability for immersive AR/VR content generation.

Abstract

The increasing demand for immersive AR/VR applications and spatial intelligence has heightened the need to generate high-quality scene-level and 360 panoramic video. However, most video diffusion models are constrained by limited resolution and aspect ratio, which restricts their applicability to scene-level dynamic content synthesis. In this work, we propose , addressing these challenges by enabling spatially scalable and panoramic dynamic scene synthesis that preserves coherence across panoramic scenes of arbitrary size. Specifically, we introduce a Offset Shifting Denoiser, facilitating efficient, synchronous, and coherent denoising panoramic dynamic scenes via a diffusion model with fixed resolution through a seamless rotating Window, which ensures seamless boundary transitions and consistency across the entire panoramic space, accommodating varying resolutions and aspect ratios. Additionally, we employ a Global Motion Guidance mechanism to ensure both local detail fidelity and global motion continuity. Extensive experiments demonstrate our method achieves superior content and motion quality in panoramic scene-level video generation, offering a training-free, efficient, and scalable solution for immersive dynamic scene creation with constant VRAM consumption regardless of the output video resolution. Project page is available at .

Paper Structure

This paper contains 18 sections, 7 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 2: Our pipeline is divided into two stages: low-resolution stage establishes a coarse motion structure, 360-degree setting(the yellow block) involves Panoramic Projecting Denoise to initialize motion that fits to spherical panorama, while the regular perspective setting(the blue block) utilizes Offset Shifting with overlap for the early denoise steps, then the remaining denoise steps are completed by our Offset Shifting Denoise. The up-scaling stage(the green block) utilizes more shift windows to produce a refined, high-resolution panorama with Global Motion Guidance from the low-resoltuion video.
  • Figure 3: The purposed Offset Shifting Window mechanism, which involves shifting denoising windows both vertically and horizontally between denoise steps to denoise the whole panorama video latent with arbitrary aspect ratio an resolution. The denosing windows are shifted vertically and horizontally every step, creating "overlap" regions between steps which mitigate the artifacts and synchronize the whole denoising process across the panorama. This results in seamless and consistent panoramic video generation with high resolution and aspect ratio.
  • Figure 4: The purposed Panoramic Projecting Denoise, where spherical panorama videos latents (represented as equirectangular projections) are projected into perspective view port windows and denoised, followed by re-projection back to the equirectangular panorama, as shown in the upper part of the figure. Those view port windows are also shifted with an offset applied in their view angles at each steps, as shown in the lower part of the figure. For legibility, only a proportion of view port regions are shown in the figure.
  • Figure 5: The Offset Shifting Denoising mechanism extended to temporal dimension. The upper part shows how the frame clip windows are shifted with an offset along the temporal dimension, with padding windows at the start and end of the video sequence. The lower part shows the loopable offset shifting denoising, with looping frames at the start and end of the frames sequence.
  • Figure 6: Example frames from generated video at the first, 16th, 32th, 48th, 64th and 80th frames generate by a diffusion model that is capable for 16 frames originally. Despite the increasing video length, the visual quality of the panorama remains consistent, demonstrating the effectiveness of our method in generating long videos.
  • ...and 2 more figures